[Geotools-devel] Caching in Geonetwork

Jesse Eichar Mon, 15 Dec 2008 05:39:57 -0800

As promised I am now going to discuss the caching I have made forGeocat, what I would like to do and different issues that come fromthis. We will make a decision on what will actually be done when FXcomes back.

First of all Emmanuel and I realized that the search was unacceptablyslow when spatial searching was enabled. This did not come as a largesurprise to me. I did a quick profile and found that the bottleneckcomes from accessing the hard-drive repeatedly for obtaining thefeatures for performing the spatial operations.


The options that I came up were the following:

Lucene filters are designed to be re-used and subsequent uses of thefilters, if used with the same Index reader, are incredibly fast. Theproblem with this that for a filter to be re-used ALL the inputs mustbe the same, at least with the filter implementation we currentlyhave. So the same user, the same spatial filter and the same query(perhaps something else I forgot?) And there must be a way to testequality on all of these inputs.Set-up the shapefile to be memory-mapped. This might work for geocatbecause I know the size for the shapefiles are quite small. But thisobviously doesnt scale. Also I dont completely trust the Geotoolsimplementation.Use the Geotools caching FeatureStore. This is an unsupported moduleso I didnt want to depend on this.Use a standard caching system. This I decided was acceptable becauseall I need to cache it the FeatureId and the geometry. The spatialindex is handled in-memory already so that is not a factor. Allcaching systems have different caching strategies and usually allowcustom implementations to be used. I chose apaches JCS 1.3 cachinglibrary because it is easy to use and has the following attributeswhich I consider important:

Max pool size to be set

Multiple caches of different sizes to be setup for different cachingrequirements

Cache-size reducer strategy

Ability to spool to disk for larger caches, reducing the amount ofenergy required to handle what is cached.

So both 1 and 4 can be used. Option 1 will give a major performanceboost on certain queries but it is hard to program so we may want toinstead cache search results instead of the filter.

So I believe that the current solution will work for geocat but inorder to have it work in a generic manner with larger dataset we willneed another solution. The biggest problem that I currently see iswhen a search is performed which returns a very large dataset. If thecache keeps the last x thousand geometries a big search will cycle allthe features in the cache. I will give a small example to make thismore clear:

If the cache is 1000 objects and the search needs 5000 geometriestested for spatial validation. The last thousand will be cached sincethey were the last recently used.For the next search the cache will be used and features 4000-5000 willbe very quick but then the other features will refill the cache sowhen the search is done the features from 3000-4000 will be cached.So we need a better solution for these sorts of searches.


There are many solutions we can consider:

I know the number of tests I need ahead of time so if it is too largeI will not replace the features in the cache just fill the cache if itis partially emptyChange the caching algorithm so that it is a combination of LRU andMost Commonly used and perhaps also take into accounts feature size,although that is probably premature optimization

Recognize common searches and cache the filter for those searches.

Since the way JCS caches to disk is quite efficient we can make thedisk cache size quite large and let the disk cache be used. The sizesare in a configuration file so it can be changed for the individualapplication.

One important point to make is that the the cache will not be used toretrieve out-of-date geometries because the features that are storedhave a featureID and an attribute that is the metadata ID. Themetadata ID is unique but the featureID is changed each time themetadata is changed. The spatial index is updated each time themetadata changes and therefore only contains valid featureIds. Sincethe spatial index is always used to obtain the featureids then onlyvalid features will be obtained from the cache and stale features willbe eventually removed from the cache but will not be used to returnincorrect results.


Sorry for the essay

Jesse

------------------------------------------------------------------------------
SF.Net email is Sponsored by MIX09, March 18-20, 2009 in Las Vegas, Nevada.
The future of the web can't happen without you.  Join us at MIX09 to help
pave the way to the Next Web now. Learn more and register at
http://ad.doubleclick.net/clk;208669438;13503038;i?http://2009.visitmix.com/

_______________________________________________
Geotools-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/geotools-devel

[Geotools-devel] Caching in Geonetwork

Reply via email to