Good evening, I have built an Rtree on HDFS, in order to improve the query performance of high-selectivity spatial queries. The Rtree is composed of a number of hdfs files (each one created by one Reducer, so as the number of the files is equal to the number of the reducers), where each file is a subtree of the root of the Rtree. I investigate the way to use the Rtree in an efficient way, with respect to the locality of each file on hdfs (data-placement).
I would like to ask, if it is possible to read a file which is on hdfs, from a java application (not MapReduce). In case this is not possible (as I believe), either I should download the files on the local filesystem (which is not a solution, since the files could be very large), orrun the queries using the Hadoop. In order to maximise the gain, I should probably process a batch of queries during each Job, and run each query on a node that is "near" to the files that are involved in handling the specific query. Can I find the node where each file is located (or at least most of its blocks), and run on that node a reducer that handles these queries? Could the function DFSClient.getBlockLocations() help ? Thank you in advance, Sofia
