Hi all,

I want to report on my success with registering and displaying GeoTiffs 
stored on HDFS.  There are some limitations with this approach; 
particularly, I am unsure if there's anyway to cache / memory-map the 
data.  As such, I believe each request is re-downloading the entire file.

Generally, I hope to document my approach well enough so that others 
could follow it (if needed) and to solicit feedback.  In terms of 
feedback, I'd love to hear 1) if there are improvements, and 2) if the 
changes are reasonable enough to be considered for a proposal/merge request.

That out of the way, here's the rough outline:

1.  Register additional URL handlers.
2.  Convince validation layers in GeoServer that 'hdfs' is an ok URL scheme.
3.  Get bytes out of the HDFS file.

For step 1, note that Java's URL scheme is pluggable via 
java.net.URLStreamHandler.  The docs(1) point out that one can call 
URL.setURLStreamHandlerFactory to setup a Factory to provide such a 
handler.  This method can only be called once, and folks from the 
internet (2) do yoga since Tomcat already registers a factory.  They 
seem to have missed the fact that the Tomcat factory actually lets you 
add your own.  I provide a gist (3) to show a little bean which will 
instantiate a Hadoop URL handler and try to install it using both of 
those methods.

There are two places I found in GeoServer which validate the URL given 
in the page for adding a GeoTiff.  The first is the GeoServer 
FileExistValidator which calls out to a Wicket UrlValidator. Telling the 
Wicket class to allow_all_schemes knocks out that issue.  For the 
second, in the FileModel, one needs to provide a happy path for URLs 
which are not local to the filesystem.  Those two small changes are here 
(4).

Once GeoServer will register a GeoTiff coverage with a non-'file://' 
URL, we need to read the bytes.  Javax has an interface 
javax.imageio.spi.ImageInputStreamSpi which adapts between instances of 
a particular class and an ImageInputStream.

For my prototype, I wrote an instance of this interface which takes a 
string, checks if it starts with "hdfs", creates a URL, and returns new 
MemoryCacheImageInputStream(url.openStream()).  The only problem with 
this approach is that there is already an implementation which handles 
Strings, and GeoTools's ImageIOExt tries the first one and skips any 
others.  One can update that handling (5) slightly to try all the 
handlers.  It'd probably be better to update (6) to try url.openStream 
as a fallback.

During testing, I worked with the sfdem.tif which ships with GeoServer.  
The hdfs layer was a little slower than the local filesystem layer, but 
it wasn't unusable.  To crank things up, I tried out a 600+ megabyte 
GeoTiff from Natural Earth, and it was downright slow.  Using a network 
monitor, I was able to observe network traffic consistent with the 
entire file being re-read for most requests.  I think this approach may 
be slightly useful for layers which are infrequently accessed and then 
only be a few users.

Thanks to everyone who had suggestions and encouragement for the 
original thread!

Cheers,

Jim

Step 1: Register additional URL handlers:

1. 
http://download.java.net/jdk7/archive/b123/docs/api/java/net/URL.html#URL%28java.lang.String,%20java.lang.String,%20int,%20java.lang.String%29

2. http://skife.org/java/url/library/2012/05/14/java_url_handlers.html

3. Gist for a bean to register the Hadoop URL handlers:
https://gist.github.com/jnh5y/1739baa42466d66e383fa26ffd7235ca

Step 2: GeoServer changes:
4. 
https://github.com/jnh5y/geoserver/commit/5320f26a0574f034433aa96097054ec1ec782d45
The FileModel change could be a little more robust.

Step 3: GeoTools changes:
5. 
https://github.com/jnh5y/geotools/commit/f2db29339c7f7e43d0c52ab93195babc1abb6f49

Or one could modify the URL handling here:
6. 
https://github.com/geosolutions-it/imageio-ext/blob/master/library/streams/src/main/java/it/geosolutions/imageio/stream/input/spi/URLImageInputStreamSpi.java#L88-L97




------------------------------------------------------------------------------
Find and fix application performance issues faster with Applications Manager
Applications Manager provides deep performance insights into multiple tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial!
https://ad.doubleclick.net/ddm/clk/302982198;130105516;z
_______________________________________________
GeoTools-Devel mailing list
GeoTools-Devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/geotools-devel

Reply via email to