Thanks Amandeep! I think what I was saying that we are trying to support both types of workloads. That is realtime transactional workloads, and batch processing for data analysis. The big question being if a single HDFS cluster should be shared between the two workflows.
The point that you are trying to make (if I am understanding you correctly) is of data "Locality". /Amandeep Khurana - "Having a common HDFS cluster and using part of the nodes as HBase RS and part as the Hadoop TTs doesn't solve the problem of moving data from the HBase RS to the tasks you'll run as a part of your MR jobs if HBase is your source/sink. You will still be reading/writing over the network." / When running MR jobs over HBase, data locality is provided by HBase (please see http://www.larsgeorge.com/2010/05/hbase-file-locality-in-hdfs.html, and also HBase the Definitive Guide by Lars George page 298 MapReduce Locality). In other words, the computation will be exported to where the data is, therefore limiting the need to transfer data over the network. Proper data locality has a big impact on the overall performance. So I believe that a common HDFS cluster does not imply logical segregation between HBase RS and Hadoop TTs. Therefore, your point seems in contradiction with Lars George's statement. Thoughts? -- View this message in context: http://apache-hbase.679495.n3.nabble.com/Shared-HDFS-for-HBase-and-MapReduce-tp4018856p4018884.html Sent from the HBase - Developer mailing list archive at Nabble.com.
