Hi all, I want to report on my success with registering and displaying GeoTiffs stored on HDFS. There are some limitations with this approach; particularly, I am unsure if there's anyway to cache / memory-map the data. As such, I believe each request is re-downloading the entire file.
Generally, I hope to document my approach well enough so that others could follow it (if needed) and to solicit feedback. In terms of feedback, I'd love to hear 1) if there are improvements, and 2) if the changes are reasonable enough to be considered for a proposal/merge request. That out of the way, here's the rough outline: 1. Register additional URL handlers. 2. Convince validation layers in GeoServer that 'hdfs' is an ok URL scheme. 3. Get bytes out of the HDFS file. For step 1, note that Java's URL scheme is pluggable via java.net.URLStreamHandler. The docs(1) point out that one can call URL.setURLStreamHandlerFactory to setup a Factory to provide such a handler. This method can only be called once, and folks from the internet (2) do yoga since Tomcat already registers a factory. They seem to have missed the fact that the Tomcat factory actually lets you add your own. I provide a gist (3) to show a little bean which will instantiate a Hadoop URL handler and try to install it using both of those methods. There are two places I found in GeoServer which validate the URL given in the page for adding a GeoTiff. The first is the GeoServer FileExistValidator which calls out to a Wicket UrlValidator. Telling the Wicket class to allow_all_schemes knocks out that issue. For the second, in the FileModel, one needs to provide a happy path for URLs which are not local to the filesystem. Those two small changes are here (4). Once GeoServer will register a GeoTiff coverage with a non-'file://' URL, we need to read the bytes. Javax has an interface javax.imageio.spi.ImageInputStreamSpi which adapts between instances of a particular class and an ImageInputStream. For my prototype, I wrote an instance of this interface which takes a string, checks if it starts with "hdfs", creates a URL, and returns new MemoryCacheImageInputStream(url.openStream()). The only problem with this approach is that there is already an implementation which handles Strings, and GeoTools's ImageIOExt tries the first one and skips any others. One can update that handling (5) slightly to try all the handlers. It'd probably be better to update (6) to try url.openStream as a fallback. During testing, I worked with the sfdem.tif which ships with GeoServer. The hdfs layer was a little slower than the local filesystem layer, but it wasn't unusable. To crank things up, I tried out a 600+ megabyte GeoTiff from Natural Earth, and it was downright slow. Using a network monitor, I was able to observe network traffic consistent with the entire file being re-read for most requests. I think this approach may be slightly useful for layers which are infrequently accessed and then only be a few users. Thanks to everyone who had suggestions and encouragement for the original thread! Cheers, Jim Step 1: Register additional URL handlers: 1. http://download.java.net/jdk7/archive/b123/docs/api/java/net/URL.html#URL%28java.lang.String,%20java.lang.String,%20int,%20java.lang.String%29 2. http://skife.org/java/url/library/2012/05/14/java_url_handlers.html 3. Gist for a bean to register the Hadoop URL handlers: https://gist.github.com/jnh5y/1739baa42466d66e383fa26ffd7235ca Step 2: GeoServer changes: 4. https://github.com/jnh5y/geoserver/commit/5320f26a0574f034433aa96097054ec1ec782d45 The FileModel change could be a little more robust. Step 3: GeoTools changes: 5. https://github.com/jnh5y/geotools/commit/f2db29339c7f7e43d0c52ab93195babc1abb6f49 Or one could modify the URL handling here: 6. https://github.com/geosolutions-it/imageio-ext/blob/master/library/streams/src/main/java/it/geosolutions/imageio/stream/input/spi/URLImageInputStreamSpi.java#L88-L97 ------------------------------------------------------------------------------ Find and fix application performance issues faster with Applications Manager Applications Manager provides deep performance insights into multiple tiers of your business applications. It resolves application problems quickly and reduces your MTTR. Get your free trial! https://ad.doubleclick.net/ddm/clk/302982198;130105516;z _______________________________________________ GeoTools-Devel mailing list GeoTools-Devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/geotools-devel