Re: [Geoserver-devel] GSIP 69 - Catalog scalability enhancements - OGC Filters VS predicate

Gabriel Roldan Thu, 03 May 2012 10:17:05 -0700

Hi Andrea, all.,

first off, sorry it took me so long to reply to this important thread,
and thanks Andrea again for splitting out the discussion into separate
topics

On Sat, Apr 28, 2012 at 10:47 AM, Andrea Aime
<[email protected]> wrote:
> Hum... the thread is getting long and mails deal with many topics, let met
> try to
> split this into separate sub-threads.
> This one is about filters, paging and sorting.
>
> About sorting I believe we are all on the same page, my suggestions
> about checking for fast sorting was just a random idea anyways,
> don't see any strong need to see it implemented.
yep, already decided to get rid of Catalog.canSort() based on feedback.

You make some good points, lets go to the most important one first,
Predicate vs OGC Filter:

To start with, I do am willing to concede on using OGC Filter instead
of the "domain specific" Predicate interface. Have a couple usability
concerns though, more about this later.

I do know it is a general purpose filter predicate language, coming
from GeoTools, which we already depend on. When I mention it as a
dependency I mean at the architectural level, not at the classpath
level. And also know from experience that our Filter implementation
works on non-feature object models, given appropriate property
extractors. If we go that route, we'll probably need a Catalog domain
specific (i.e., *Info) property extractor.

I do understand OGC Filter is richer and more expressive than the
(purposedly) limited number of "well known" Predicate idioms. More
about rationale later.

Your argument about ease of encoding Filter because we already have
code to do that, applies, AFAIK, to a specific data model (flat
table), at least in practice, except if you want to use app-schema. It
is not a bad argument, just falling on the implementation detail side
of the fence. I faithfuly assume you value the decoupling of the
interface from the underlying data model, so nothing to add here.
About the availability of filter splitters, that's indeed a very nice
bonus point for the OGC Filter choice.

The argument about lack of spatial filtering in Predicate is
debatible. The reason there are no well known spatial predicates in
the proposal is because the set of proposed well known constructs is
strictly limited to fit the current use of Catalog. That is, equals,
"iLike", and, or, exists, isNull. Nothing impedes adding spatial
constructs once the real need for them arise, but if you want them
executed by the storage engine, then that also meants imposing a new
requirement on the catalog backend, to support spatial queries, which
currently is not a requirement. If, by the other hand, you want
spatial filters without imposing the backend to support spatial, then
it's just as easy to create a predicate as an anonymous inner class.

In order to make this as short as possible: a possible CSW
implementation on top of the GeoServer Catalog, ability to specify a
CQL filter as a GetCapabilities parameter, ability to wrap the Catalog
as a DataStore and hence draw maps of where the layers are and expose
that information through WFS, are all really nice ideas. I want to
make clear I'm not against any of those, and actually encourage that
kind of feedback whenever a new proposal is made, since that's how we
as a community ensure our product serves everyone's needs. On the
other side, the best way we know so far to make things happen, is to
tie ourselves to an iterative and incremental development process. The
focus of GSIP69 is to solve today's GeoServer scalability problems.
Nonetheless, this kind of feedback is valuable in terms of planning
ahead for extensibility. So thanks for bringing up those points. That
said, I don't really see any of those new feature ideas as blockers
for having a domain specific query constructs for the Catalog. But
think it's pointless to develop too deep on each of them right now,
but acknowledge the current infrastructure to work with OGC Filters
would be convenient, instead of translating CQL to Predicate,
regardless of not being hard at all, indeed looks like unnecessary
duplication.

So, down to the core of this discussion, what we're essentially
discussing about is a design decision. I am sorry it seemed like the
one I made was lightly taken. It was not, which doesn't mean it was
the most accurate either. I seriously considered our already available
OGC Filter implementation as the first choice, as it seems to fit
naturally, or rather easily.
Please note though, the proposal is on "under discussion" status, so
that's exactly what should be happening. Hence I'm glad we're doing
so, ad I'd like to see this kind of discussions as I found myself
often looking at bad smelling pieces of code throughout our codebase.
Including, of course, my own code.

All that said, I think it's turn to weight in both options so we can
take a decision knowing what the benefits and drawbacks are for each
one.

The following is _my_ current thinking on both approaches. Feel free
to complete/correct with your own. I know everything is debatible,
just trying to figure out a sensible set of pros and cons so that we
can make an informed decision.

Option #1, use OGC Filter as the Catalog query model
----------------------------------------------------
Rationale: convenience re existing infrastructure, familiarity, high
expresiveness, hability to create complex filters.

Benefits
========

- Code reuse: GeoTools' OGC Filter implementation is in wide use by
the Data Access APIs, meaning there are a lot of utilities to deal
with them already, like filter splitters.

- Familiarity: most developers that work with the GeoServer Catalog,
are probably already familiar with the GeoTools Filter API.

- Wide range of ready to use filter constructs: almost everything you
can translate to a SQL where clause is in there.

Caveats
=======

- Difficult extensibility. The way to extend filter funtionality is by
creating custom Functions. There are cases where the required filter
can't be expressed using the prescribed OGC Filter idioms.

     FilerFactory ff = CommonFactoryFinder.getFilterFactory(null);
     Filter enabledLayers = ff.equals(ff.propertyName("enabled"),
ff.literal(Boolean.TRUE));
     Filter brokenLayers = ff.and(enabledLayers, ???);
Here we want brokenLayers to be a filter that returns all layers that
are enabled by configuration, but broken or disabled by cascading,
using the derived enabled() property. This is not possible wihout
registering a function factory with a function specific for that
purpose. But doing so would make it available globally, whereas it's
domain specific. If, by the other hand, you simple get the layers
enabled by configuration (i.e LayerInfo.isEnabled()), and then iterate
over them on client code and check every one fo the the derived
enabled() property, you lose the ability of executing the predicate
back in the chain, unnecesarily making all the Catalog wrappers to
create wrapper objects for them.

     Any non natively encodable predicate suffers from this issue.
Another example is the security filtering on SecureCatalogImpl. This
filtering is based on externaly configured constraints that may or may
not be translated to a "natively encodable" query predicate. Yet, the
logic may change over time. The proposed approach builds an custom
predicate that is evaluated in-process, with the important
characteristic of being pushed back to be evaluated before being
returned to the calling code. So even if its not "natively encodable",
it also has the intended effect of being processed at the bottom of
the call chain, avoiding any catalog wrapper (including
SecureCatalogImpl) to create object wrappers for results that are then
to be discarded, hence lowering the memory consumption and GC
overhead.

- Deviation from simple property filtering: some filters are not so
straight forward. For example, querying for simple properties of
multivalued properties would require some sort of XPath syntax. How
well frameworks like JXPath fit into our object model is to be
assessed. We have steady and dynamic properties (through MatadataMap).
Using a custom propertyExtractor that changes the regular OGC Filter
property addressing, to fit our desire for simplicity, would lead to
confussion, whilst with our own query model we can make is straight
forward. A filter like "styles.id = 'someid'" in the proposed query
model is as simple as that, an OGC Filter would ma

- Pandora's box: There's so much functionality in OGC Filter that is
not proved against the Catalog object model, that ensuring proper
functioning of each and every possible filter construct may require a
significant effort, opening the door for random bug reports over
untested code paths.

- Implementation constraints: it is considered a desirable feature for
the API to impose as little constraints on implementations as
possible. The usage of Catalog and hence its query needs so far are
rather simple: find by id, find by name, and little more. Using OGC
Filter either imposes the backend to be able of taking care of a lot
of filter constructs that might never see real use, or rather
executing them mostly in-process, that the

Option #2, define a GeoServer Catalog's own query model
-------------------------------------------------------
Rationale: architectural consistency, cohesion, easy of use, extensibility.

Benefits
========
- Easy extensibility: no need to register extra factories, but using
simple anonymous inner classes to create ad-hoc predicates. Example:
   Predicate<LayerInfo> enabledLayers = propertyEquals("enabled", true)
   Predicate<LayerInfo> brokenLayers = and(enabledLayers, new
Predicate<LayerInfo>(){
        @Override
        public boolean apply(LayerInfo layer){
           return layer.enabled(); // <------ note the use of the
derived enabled() property instead of the POJO isEnabled() one
        }
    });

- Easy of use: avoid casting everywhere. We're working with
CatalogInfo and derivatives, so lets make use of modern language
constructs.

- Scope and feature creep contention[1]: by limiting the number of
well known predicates to the minimum indispensable we keep in control
of what can be done and how. And hence try to impose as little
implementation contrainst as possible, and keep focus on what the
Catalog is for. Nothing impedes new features to be built that use or
are based on the Catalog objects. But adding too many features and
increasing scope just in case can lead to unnecesary complexity and
hurt maintainability. Another example of this is the recent move from
GeoWebCache's tile layer configuration out of the Catalog objects
metadata map: initally it seemed convenient to (ab)use the LayerInfo
and LayerGroupInfo metadata maps to hold the related tiled layer
configuration. As complexity of the gwc integration configuration
grew, the approach started to show its drawbacks: catalog object
configuration flooding, lack of quick ways to get only the layers that
do have an associated tile layer, and more. Moving away from that
model and having the integrated GWC maintain its own set of
configuration objects, although still depending on the available
GeoServer catalog layers, eliminated the added complexity on catalog
configuration and avoided having to extend it or modify it just to
serve an orthogonal concern.

- Less implementation constraints: it was and still is a driving
principle to impose as little implementation constrants on catalog
backends as possible. Given the proposal targets catalog scalability
and there's more than one way of pealing a cat.

[1]
http://en.wikipedia.org/wiki/Scope_creep
http://en.wikipedia.org/wiki/Feature_creep

Caveats
=======

- Yet another query predicate: although it's meant to be really
straightforward (there's no much to debate about the meaning of
equals, isNull, contains (a.k.a iLike), and, and or), we need to
recognize it is just yet another query predicate "language".

- Limited set of well known predicate idioms: although the proponent
thinks of this as a feature and not a bug, an argument can be made the
other way around, depending on which characteristics you value the
most, and where you want to draw the application boundaries
architecture wise.

------------------------------

All this said, I re-enforce this is a design decision I'm willing to
concede on and switch to OGC Filter. The only real blocker IMHO is the
inability to easily extend in-place, but having to use Catalog
specific function factories, flooding the general purpose filters with
catalog specific ones, or rather having to give up execution of
predicates on the backend and being forced to iterate (over a lot of
objects) in place, apply the in-process filtering on the client code,
and be exposed to unnecesary wrapping from catalog decorators.

Some other random thoughts in line.

>
> About the topic of the predicate API being simpler than the OGC one... yes,
> it has less filters, which actually makes it less useful.
My position is it makes it as useful as needed.
> About it being harder to build filters, I don't see it.
> The filter building styles of both goes through a factory with short
> named methods and arguably OGC allows for CQL expressions to
> be used if that is perceived to be simpler.

hmmm.. at a first glance, yes, it looks simpler. And indeed CQL is a
nice terse way of creating a Filter on test cases.
When it comes to actual application code, where the inputs are
dynamically obtained from user input, it's not so. You'd need to deal
with string concatenation and proper parameter escaping to make sure
the resulting CQL is well formed.

>
> I don't see many people implementing lots of catalog subsystem
> implementations,
> and those people will likely deal with GeoServer in other ways so they will
> have a passing familiarity with OGC filter concepts already.
> Having to learn another API actually makes things more confusing, you
> need to remember what each API does and how.

I see your point, but think it has little applicability. We're not
changing the meaning of equals, isNull, exists, contains (can rename
it iLike looks more familiar), and, and or.

>
> Encoding wise, we already have a lot of code that allows to split filter and
> encode
> them in SQL and other languages, which means we have examples on
> how things are done. Filter splitters do not make any use of the feature
> type, they only know about the filter types listed in a filter capabilities
> object
> (and they can be subclassed to allow more targeted checks),
> and filter encoders are something you need to prime often with a feature
> type,
> while in this case you'll have to prime them with the bean class.
> I don't see the difference nor the difficulty.
>
> Lack of spatial filter is problematic because of possible
> CSW implementations, but also for security subsystems that often
> express spatial constraints on the data (and layers) you can actually
> see, not having an efficient way to make them run looks like a serious
> drawback to me.
>
> But also think about the case in which you are not doing multitenancy,
> but you do have tons of layers. I know of one installation in Italy at
> a research center that had, one year ago, 160k layers registered,
> with new ones showing up every day.
> Think how useful it would be for a case like this to be be able to pass down
> a CQL filter on the GetCapabilies to get a more focused capabilities
> document
> for WMS/WFS/WCS usage.
>
> About the GUI filter that is not implementable efficiently in Predicate,
> I did not notice "contains" is a well known filter, so sorry about that one,
> reviewing the whole work in just half a day means I could not actually read
> everything line by line (and often I had difficulties understanding what the
> code did, see also the other mail about catalog implementations, but
> generally speaking the proposal was rich in terms of describing api and
> architectural concepts and poor in terms of describing how things are done,
> which is equally important for something that aims at being committed).

Good advise. I'll try to point out how things are implemented on the
proposal wherever it's appropriate. A point can be made that the
community module is not strictly part of the proposal, but an API
validation AIM, as we're not proposing to replace the default catalog
by it. But in any case I see what you mean and seems valuable feedback
to me.

Cheers,
Gabriel.
>
> About the relationship between catalog and data stores I don't want in
> any way impose it, nor I want to have abstraction layers be broken,
> I simply recognize OGC filter as a generic data access filtering API
> while it seems that you see it as something specific to GeoTools... but
> it's not, it's not meant to be, using property extractors you can actually
> have it filter whatever, features, beans, hashmaps, spatial stuff and
> non spatial one.
>
> The thing could then go two ways. An easy nice to have is to
> be able to build a store on top of the catalog that would allow
> to display WMS maps of where the layers are, and search over
> the catalog via simple WFS (which can be a nice way to allow
> someone that wants to search into the server without having to
> build or use a CSW client). In both cases having spatial filters would
> be really handy, but in general the richness of OGC filter would
> allow for complex searches to be made fast.
>
> The other direction is to be able to build a catalog the other way, that is,
> build it on top of a data store. Now, I'm not sure it would be great, and
> we may not want to use it, but it would likely make for a quick to
> implement spatially searchable catalog.
> (mind, this is not the strong argument, I'm actually just thinking out loud,
> the strong argument is that the filter API is good, tested, well known,
> rich and flexible, Predicate is none of that, the rest of the arguments
> are just topping on the cake).
> Feature types could be created by flattening the objects and creating
> the feature type by reflection. Of course this would break the moment
> new attributes show up, but we can call updateSchema() to add those, right?
> It would also make for a more "relational" setup on DBMS storage, which
> many people would feel more comfortable with, and would make it easier
> for other applications to directly edit the persisted catalog (something I'm
> sure many people will want to do).
>
> Cheers
> Andrea

-- 
Gabriel Roldan
OpenGeo - http://opengeo.org
Expert service straight from the developers.

------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Geoserver-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/geoserver-devel

Re: [Geoserver-devel] GSIP 69 - Catalog scalability enhancements - OGC Filters VS predicate

Reply via email to