As promised I am now going to discuss the caching I have made for
Geocat, what I would like to do and different issues that come from
this. We will make a decision on what will actually be done when FX
comes back.
First of all Emmanuel and I realized that the search was unacceptably
slow when spatial searching was enabled. This did not come as a large
surprise to me. I did a quick profile and found that the bottleneck
comes from accessing the hard-drive repeatedly for obtaining the
features for performing the spatial operations.
The options that I came up were the following:
Lucene filters are designed to be re-used and subsequent uses of the
filters, if used with the same Index reader, are incredibly fast. The
problem with this that for a filter to be re-used ALL the inputs must
be the same, at least with the filter implementation we currently
have. So the same user, the same spatial filter and the same query
(perhaps something else I forgot?) And there must be a way to test
equality on all of these inputs.
Set-up the shapefile to be memory-mapped. This might work for geocat
because I know the size for the shapefiles are quite small. But this
obviously doesnt scale. Also I dont completely trust the Geotools
implementation.
Use the Geotools caching FeatureStore. This is an unsupported module
so I didnt want to depend on this.
Use a standard caching system. This I decided was acceptable because
all I need to cache it the FeatureId and the geometry. The spatial
index is handled in-memory already so that is not a factor. All
caching systems have different caching strategies and usually allow
custom implementations to be used. I chose apaches JCS 1.3 caching
library because it is easy to use and has the following attributes
which I consider important:
Max pool size to be set
Multiple caches of different sizes to be setup for different caching
requirements
Cache-size reducer strategy
Ability to spool to disk for larger caches, reducing the amount of
energy required to handle what is cached.
So both 1 and 4 can be used. Option 1 will give a major performance
boost on certain queries but it is hard to program so we may want to
instead cache search results instead of the filter.
So I believe that the current solution will work for geocat but in
order to have it work in a generic manner with larger dataset we will
need another solution. The biggest problem that I currently see is
when a search is performed which returns a very large dataset. If the
cache keeps the last x thousand geometries a big search will cycle all
the features in the cache. I will give a small example to make this
more clear:
If the cache is 1000 objects and the search needs 5000 geometries
tested for spatial validation. The last thousand will be cached since
they were the last recently used.
For the next search the cache will be used and features 4000-5000 will
be very quick but then the other features will refill the cache so
when the search is done the features from 3000-4000 will be cached.
So we need a better solution for these sorts of searches.
There are many solutions we can consider:
I know the number of tests I need ahead of time so if it is too large
I will not replace the features in the cache just fill the cache if it
is partially empty
Change the caching algorithm so that it is a combination of LRU and
Most Commonly used and perhaps also take into accounts feature size,
although that is probably premature optimization
Recognize common searches and cache the filter for those searches.
Since the way JCS caches to disk is quite efficient we can make the
disk cache size quite large and let the disk cache be used. The sizes
are in a configuration file so it can be changed for the individual
application.
One important point to make is that the the cache will not be used to
retrieve out-of-date geometries because the features that are stored
have a featureID and an attribute that is the metadata ID. The
metadata ID is unique but the featureID is changed each time the
metadata is changed. The spatial index is updated each time the
metadata changes and therefore only contains valid featureIds. Since
the spatial index is always used to obtain the featureids then only
valid features will be obtained from the cache and stale features will
be eventually removed from the cache but will not be used to return
incorrect results.
Sorry for the essay
Jesse
------------------------------------------------------------------------------
SF.Net email is Sponsored by MIX09, March 18-20, 2009 in Las Vegas, Nevada.
The future of the web can't happen without you. Join us at MIX09 to help
pave the way to the Next Web now. Learn more and register at
http://ad.doubleclick.net/clk;208669438;13503038;i?http://2009.visitmix.com/
_______________________________________________
Geotools-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/geotools-devel