We've created an implementation of FileSystem which allows us to use Sector (http://sector.sourceforge.net/) as the backing store for Hadoop. This implementation is functionally complete, and we can now run Hadoop MapReduce jobs against data stored in Sector. We're now looking at how to optimize this interface, since the performance suffers considerably compared to MR processing run against HDFS. Sector is written in C++, so there's some unavoidable overhead from JNI. One big difference between Sector and HDFS is that Sector is file-based and not block-based - files are stored intact on the native file system. We suspect this may have something to do with the poor performance, since Hadoop seems to be optimized for a block-based file system.
Based on the assumption that properly supporting data locality will have a large impact on performance, we've implemented getFileBlockLocations(). Since we don't have blocks our implementation basically creates a BlockLocation containing an array of nodes hosting the file. The following is what our method looks like: public BlockLocation[] getFileBlockLocations( final Path path ) throws FileNotFoundException, IOException { SNode stat = jniClient.sectorStat( path.toString() ); String[] locs = stat.getLocations(); if ( locs == null ) { return null; } BlockLocation[] blocs = new BlockLocation[1]; blocs[0] = new BlockLocation(null, locs, 0L, stat.getSize() ); return blocs; } In the code above, we are using file size, stat.getSize(), as the length since a block is a file. This means that the offset is always 0L. This method seems to have improved performance somewhat, but we're wondering if there's a modification we can make that will better help Hadoop locate data. If we find a way to index our files to make them appear as blocks to Hadoop, would that provide a performance benefit? Any suggestions are appreciated. We're currently testing with Hadoop 0.18.3. Thanks. -- Jonathan Seidman Open Data Group