Regarding locality, it's not just Lars' stuff, it's in the RefGuide (see section 9.7.3)Š
http://hbase.apache.org/book.html#regions.arch re: "You will still be reading/writing over the network" This is definitely true as far as writes go because of the replicas (see the RefGuide for why), although I disagree on the read portion unless there is an exceptional case (which typically the result of an RS going down) On 6/6/12 4:27 PM, "Atif Khan" <[email protected]> wrote: >Thanks Amandeep! > >I think what I was saying that we are trying to support both types of >workloads. That is realtime transactional workloads, and batch processing >for data analysis. The big question being if a single HDFS cluster should >be shared between the two workflows. > >The point that you are trying to make (if I am understanding you >correctly) >is of data "Locality". > >/Amandeep Khurana - "Having a common HDFS cluster and using part of the >nodes as HBase RS and part as the Hadoop TTs doesn't solve the problem of >moving data from the HBase RS to the tasks you'll run as a part of your MR >jobs if HBase is your source/sink. You will still be reading/writing over >the network." >/ > >When running MR jobs over HBase, data locality is provided by HBase >(please >see http://www.larsgeorge.com/2010/05/hbase-file-locality-in-hdfs.html, >and >also HBase the Definitive Guide by Lars George page 298 MapReduce >Locality). >In other words, the computation will be exported to where the data is, >therefore limiting the need to transfer data over the network. Proper >data >locality has a big impact on the overall performance. > >So I believe that a common HDFS cluster does not imply logical segregation >between HBase RS and Hadoop TTs. Therefore, your point seems in >contradiction with Lars George's statement. > >Thoughts? > > >-- >View this message in context: >http://apache-hbase.679495.n3.nabble.com/Shared-HDFS-for-HBase-and-MapRedu >ce-tp4018856p4018884.html >Sent from the HBase - Developer mailing list archive at Nabble.com. >
