Re: [Geotools-devel] Caching data

Christophe Rousson Tue, 05 Jun 2007 07:20:13 -0700

Hi Andrea,


In your approach you're assuming that:

a) id are stable, that is, the same feature will always have the same id
b) gathering only ids is significantly faster than retrieving the whole
    feature.
c) you can gather easiliy a high number of fids from the datastore



I have in the idea  that the cache should map features by id, it is
basically a hash map.
One other assumption is the cache should be optimized for BBox queries, but
can't really help with other queries, since there are no means to know if
the cache already holds all the queried data.

In the case of a BBox query, the cache should be able to answer the query
without the help of the source datastore by keeping track of queries bounds
to know what data it already holds.

With other queries, it is more difficult to imagine a query tracker, which
is a component that can examine the query and tell the cache if it already
has the data or not. What I was thiniking of, is that the cache could
delegate the query to the source data store in two steps :

1. ask the source datastore for the ids of the features that should be
returned
2. ask the source datastore for the features the cache doesn't hold yet

Of course, this may result in significant overhead.


All assumptions are reasonable, but you must be aware that they

may be broken by certain data stores:
a) will be broken by shapefile data stores (when edited, since the id
    is simply the row number in the dbf file), by jdbc data stores
    mapping a table without a primary key, and by extension, by every
    layer server by a remote WFS server such as Geoserver, when
    Geoserver is using the above cited data stores
b) file based data stores may have to gather good part of the
    feature to evaluate the filter, and reading data is where you pay
    most of the price.



Good point for you. Though I don't really see a workaround.

In this case, this means the cache should flush features when it needs to,
and don't remember it once indexed them, so it will have to ask again to the
source datastore. I indeed was thinking of FIDs as weak reference to
features, so that the cache could forget about an actual feature, but still
know that this feature should be part of the answer to a query.

This may not be a good idea. I suggest then the cache makes room for new
features by deindexing features, for example by flushing some part of a
R-tree, and tells the query tracker about the forgotten spatial area.

b) is a datastore implementation issue. Can it be addressed by a cache
component ?

c) jdbc data stores will turn a fid filter with lots of fids into

    a "fid in (f1, f2, ..., fn)" query, or an equivalent "fid = f1
    or fid = f2 or .... or fid = fn". When fids are many, you may
    hit the query size limit of the datastore. This is more of a
    limitation in jdbc data stores implementation, but one you
    may encounter when starting to play with big amounts of data.



This is indeed a limitation I had not in mind.

Moreover, hashmap like data structure will use lots of memory if you

cache each single feature by its id, that's why I tried to split
the cartesian plane in tiles that can be cached as wholes, because
that would have reduced the number of entries in the hashtable.



I will consider this solution. You propose to retrieve and store data by
tiles, do you ?
This will make first queries more expensive, because you will have to get
more data than you actually need, but this is done for next times.
I am not sure to see the storage advantage, though it will reduce the number
of storage access.

That said, the solution I did propose it's a lot harder, so I'm

not here suggesting that you use it, but only
trying to make you aware of possible issues you'll encounter during
the development of your SoC project. Better be prepared than sorry no?



I am not sure I understand the whole of your solution.
But thank you for ideas, these are good materials for further thinking.

Cheers,

Christophe

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

_______________________________________________
Geotools-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/geotools-devel

Re: [Geotools-devel] Caching data

Reply via email to