As promised I am now going to discuss the caching I have made for Geocat, what I would like to do and different issues that come from this. We will make a decision on what will actually be done when FX comes back.

First of all Emmanuel and I realized that the search was unacceptably slow when spatial searching was enabled. This did not come as a large surprise to me. I did a quick profile and found that the bottleneck comes from accessing the hard-drive repeatedly for obtaining the features for performing the spatial operations.

The options that I came up were the following:
Lucene filters are designed to be re-used and subsequent uses of the filters, if used with the same Index reader, are incredibly fast. The problem with this that for a filter to be re-used ALL the inputs must be the same, at least with the filter implementation we currently have. So the same user, the same spatial filter and the same query (perhaps something else I forgot?) And there must be a way to test equality on all of these inputs. Set-up the shapefile to be memory-mapped. This might work for geocat because I know the size for the shapefiles are quite small. But this obviously doesnt scale. Also I dont completely trust the Geotools implementation. Use the Geotools caching FeatureStore. This is an unsupported module so I didnt want to depend on this. Use a standard caching system. This I decided was acceptable because all I need to cache it the FeatureId and the geometry. The spatial index is handled in-memory already so that is not a factor. All caching systems have different caching strategies and usually allow custom implementations to be used. I chose apaches JCS 1.3 caching library because it is easy to use and has the following attributes which I consider important:
Max pool size to be set
Multiple caches of different sizes to be setup for different caching requirements
Cache-size reducer strategy
Ability to spool to disk for larger caches, reducing the amount of energy required to handle what is cached.



So both 1 and 4 can be used. Option 1 will give a major performance boost on certain queries but it is hard to program so we may want to instead cache search results instead of the filter.

So I believe that the current solution will work for geocat but in order to have it work in a generic manner with larger dataset we will need another solution. The biggest problem that I currently see is when a search is performed which returns a very large dataset. If the cache keeps the last x thousand geometries a big search will cycle all the features in the cache. I will give a small example to make this more clear:

If the cache is 1000 objects and the search needs 5000 geometries tested for spatial validation. The last thousand will be cached since they were the last recently used. For the next search the cache will be used and features 4000-5000 will be very quick but then the other features will refill the cache so when the search is done the features from 3000-4000 will be cached. So we need a better solution for these sorts of searches.

There are many solutions we can consider:
I know the number of tests I need ahead of time so if it is too large I will not replace the features in the cache just fill the cache if it is partially empty Change the caching algorithm so that it is a combination of LRU and Most Commonly used and perhaps also take into accounts feature size, although that is probably premature optimization
Recognize common searches and cache the filter for those searches.
Since the way JCS caches to disk is quite efficient we can make the disk cache size quite large and let the disk cache be used. The sizes are in a configuration file so it can be changed for the individual application.

One important point to make is that the the cache will not be used to retrieve out-of-date geometries because the features that are stored have a featureID and an attribute that is the metadata ID. The metadata ID is unique but the featureID is changed each time the metadata is changed. The spatial index is updated each time the metadata changes and therefore only contains valid featureIds. Since the spatial index is always used to obtain the featureids then only valid features will be obtained from the cache and stale features will be eventually removed from the cache but will not be used to return incorrect results.

Sorry for the essay

Jesse
------------------------------------------------------------------------------
SF.Net email is Sponsored by MIX09, March 18-20, 2009 in Las Vegas, Nevada.
The future of the web can't happen without you.  Join us at MIX09 to help
pave the way to the Next Web now. Learn more and register at
http://ad.doubleclick.net/clk;208669438;13503038;i?http://2009.visitmix.com/
_______________________________________________
Geotools-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/geotools-devel

Reply via email to