Hi Andrea, all., first off, sorry it took me so long to reply to this important thread, and thanks Andrea again for splitting out the discussion into separate topics
On Sat, Apr 28, 2012 at 10:47 AM, Andrea Aime <[email protected]> wrote: > Hum... the thread is getting long and mails deal with many topics, let met > try to > split this into separate sub-threads. > This one is about filters, paging and sorting. > > About sorting I believe we are all on the same page, my suggestions > about checking for fast sorting was just a random idea anyways, > don't see any strong need to see it implemented. yep, already decided to get rid of Catalog.canSort() based on feedback. You make some good points, lets go to the most important one first, Predicate vs OGC Filter: To start with, I do am willing to concede on using OGC Filter instead of the "domain specific" Predicate interface. Have a couple usability concerns though, more about this later. I do know it is a general purpose filter predicate language, coming from GeoTools, which we already depend on. When I mention it as a dependency I mean at the architectural level, not at the classpath level. And also know from experience that our Filter implementation works on non-feature object models, given appropriate property extractors. If we go that route, we'll probably need a Catalog domain specific (i.e., *Info) property extractor. I do understand OGC Filter is richer and more expressive than the (purposedly) limited number of "well known" Predicate idioms. More about rationale later. Your argument about ease of encoding Filter because we already have code to do that, applies, AFAIK, to a specific data model (flat table), at least in practice, except if you want to use app-schema. It is not a bad argument, just falling on the implementation detail side of the fence. I faithfuly assume you value the decoupling of the interface from the underlying data model, so nothing to add here. About the availability of filter splitters, that's indeed a very nice bonus point for the OGC Filter choice. The argument about lack of spatial filtering in Predicate is debatible. The reason there are no well known spatial predicates in the proposal is because the set of proposed well known constructs is strictly limited to fit the current use of Catalog. That is, equals, "iLike", and, or, exists, isNull. Nothing impedes adding spatial constructs once the real need for them arise, but if you want them executed by the storage engine, then that also meants imposing a new requirement on the catalog backend, to support spatial queries, which currently is not a requirement. If, by the other hand, you want spatial filters without imposing the backend to support spatial, then it's just as easy to create a predicate as an anonymous inner class. In order to make this as short as possible: a possible CSW implementation on top of the GeoServer Catalog, ability to specify a CQL filter as a GetCapabilities parameter, ability to wrap the Catalog as a DataStore and hence draw maps of where the layers are and expose that information through WFS, are all really nice ideas. I want to make clear I'm not against any of those, and actually encourage that kind of feedback whenever a new proposal is made, since that's how we as a community ensure our product serves everyone's needs. On the other side, the best way we know so far to make things happen, is to tie ourselves to an iterative and incremental development process. The focus of GSIP69 is to solve today's GeoServer scalability problems. Nonetheless, this kind of feedback is valuable in terms of planning ahead for extensibility. So thanks for bringing up those points. That said, I don't really see any of those new feature ideas as blockers for having a domain specific query constructs for the Catalog. But think it's pointless to develop too deep on each of them right now, but acknowledge the current infrastructure to work with OGC Filters would be convenient, instead of translating CQL to Predicate, regardless of not being hard at all, indeed looks like unnecessary duplication. So, down to the core of this discussion, what we're essentially discussing about is a design decision. I am sorry it seemed like the one I made was lightly taken. It was not, which doesn't mean it was the most accurate either. I seriously considered our already available OGC Filter implementation as the first choice, as it seems to fit naturally, or rather easily. Please note though, the proposal is on "under discussion" status, so that's exactly what should be happening. Hence I'm glad we're doing so, ad I'd like to see this kind of discussions as I found myself often looking at bad smelling pieces of code throughout our codebase. Including, of course, my own code. All that said, I think it's turn to weight in both options so we can take a decision knowing what the benefits and drawbacks are for each one. The following is _my_ current thinking on both approaches. Feel free to complete/correct with your own. I know everything is debatible, just trying to figure out a sensible set of pros and cons so that we can make an informed decision. Option #1, use OGC Filter as the Catalog query model ---------------------------------------------------- Rationale: convenience re existing infrastructure, familiarity, high expresiveness, hability to create complex filters. Benefits ======== - Code reuse: GeoTools' OGC Filter implementation is in wide use by the Data Access APIs, meaning there are a lot of utilities to deal with them already, like filter splitters. - Familiarity: most developers that work with the GeoServer Catalog, are probably already familiar with the GeoTools Filter API. - Wide range of ready to use filter constructs: almost everything you can translate to a SQL where clause is in there. Caveats ======= - Difficult extensibility. The way to extend filter funtionality is by creating custom Functions. There are cases where the required filter can't be expressed using the prescribed OGC Filter idioms. FilerFactory ff = CommonFactoryFinder.getFilterFactory(null); Filter enabledLayers = ff.equals(ff.propertyName("enabled"), ff.literal(Boolean.TRUE)); Filter brokenLayers = ff.and(enabledLayers, ???); Here we want brokenLayers to be a filter that returns all layers that are enabled by configuration, but broken or disabled by cascading, using the derived enabled() property. This is not possible wihout registering a function factory with a function specific for that purpose. But doing so would make it available globally, whereas it's domain specific. If, by the other hand, you simple get the layers enabled by configuration (i.e LayerInfo.isEnabled()), and then iterate over them on client code and check every one fo the the derived enabled() property, you lose the ability of executing the predicate back in the chain, unnecesarily making all the Catalog wrappers to create wrapper objects for them. Any non natively encodable predicate suffers from this issue. Another example is the security filtering on SecureCatalogImpl. This filtering is based on externaly configured constraints that may or may not be translated to a "natively encodable" query predicate. Yet, the logic may change over time. The proposed approach builds an custom predicate that is evaluated in-process, with the important characteristic of being pushed back to be evaluated before being returned to the calling code. So even if its not "natively encodable", it also has the intended effect of being processed at the bottom of the call chain, avoiding any catalog wrapper (including SecureCatalogImpl) to create object wrappers for results that are then to be discarded, hence lowering the memory consumption and GC overhead. - Deviation from simple property filtering: some filters are not so straight forward. For example, querying for simple properties of multivalued properties would require some sort of XPath syntax. How well frameworks like JXPath fit into our object model is to be assessed. We have steady and dynamic properties (through MatadataMap). Using a custom propertyExtractor that changes the regular OGC Filter property addressing, to fit our desire for simplicity, would lead to confussion, whilst with our own query model we can make is straight forward. A filter like "styles.id = 'someid'" in the proposed query model is as simple as that, an OGC Filter would ma - Pandora's box: There's so much functionality in OGC Filter that is not proved against the Catalog object model, that ensuring proper functioning of each and every possible filter construct may require a significant effort, opening the door for random bug reports over untested code paths. - Implementation constraints: it is considered a desirable feature for the API to impose as little constraints on implementations as possible. The usage of Catalog and hence its query needs so far are rather simple: find by id, find by name, and little more. Using OGC Filter either imposes the backend to be able of taking care of a lot of filter constructs that might never see real use, or rather executing them mostly in-process, that the Option #2, define a GeoServer Catalog's own query model ------------------------------------------------------- Rationale: architectural consistency, cohesion, easy of use, extensibility. Benefits ======== - Easy extensibility: no need to register extra factories, but using simple anonymous inner classes to create ad-hoc predicates. Example: Predicate<LayerInfo> enabledLayers = propertyEquals("enabled", true) Predicate<LayerInfo> brokenLayers = and(enabledLayers, new Predicate<LayerInfo>(){ @Override public boolean apply(LayerInfo layer){ return layer.enabled(); // <------ note the use of the derived enabled() property instead of the POJO isEnabled() one } }); - Easy of use: avoid casting everywhere. We're working with CatalogInfo and derivatives, so lets make use of modern language constructs. - Scope and feature creep contention[1]: by limiting the number of well known predicates to the minimum indispensable we keep in control of what can be done and how. And hence try to impose as little implementation contrainst as possible, and keep focus on what the Catalog is for. Nothing impedes new features to be built that use or are based on the Catalog objects. But adding too many features and increasing scope just in case can lead to unnecesary complexity and hurt maintainability. Another example of this is the recent move from GeoWebCache's tile layer configuration out of the Catalog objects metadata map: initally it seemed convenient to (ab)use the LayerInfo and LayerGroupInfo metadata maps to hold the related tiled layer configuration. As complexity of the gwc integration configuration grew, the approach started to show its drawbacks: catalog object configuration flooding, lack of quick ways to get only the layers that do have an associated tile layer, and more. Moving away from that model and having the integrated GWC maintain its own set of configuration objects, although still depending on the available GeoServer catalog layers, eliminated the added complexity on catalog configuration and avoided having to extend it or modify it just to serve an orthogonal concern. - Less implementation constraints: it was and still is a driving principle to impose as little implementation constrants on catalog backends as possible. Given the proposal targets catalog scalability and there's more than one way of pealing a cat. [1] http://en.wikipedia.org/wiki/Scope_creep http://en.wikipedia.org/wiki/Feature_creep Caveats ======= - Yet another query predicate: although it's meant to be really straightforward (there's no much to debate about the meaning of equals, isNull, contains (a.k.a iLike), and, and or), we need to recognize it is just yet another query predicate "language". - Limited set of well known predicate idioms: although the proponent thinks of this as a feature and not a bug, an argument can be made the other way around, depending on which characteristics you value the most, and where you want to draw the application boundaries architecture wise. ------------------------------ All this said, I re-enforce this is a design decision I'm willing to concede on and switch to OGC Filter. The only real blocker IMHO is the inability to easily extend in-place, but having to use Catalog specific function factories, flooding the general purpose filters with catalog specific ones, or rather having to give up execution of predicates on the backend and being forced to iterate (over a lot of objects) in place, apply the in-process filtering on the client code, and be exposed to unnecesary wrapping from catalog decorators. Some other random thoughts in line. > > About the topic of the predicate API being simpler than the OGC one... yes, > it has less filters, which actually makes it less useful. My position is it makes it as useful as needed. > About it being harder to build filters, I don't see it. > The filter building styles of both goes through a factory with short > named methods and arguably OGC allows for CQL expressions to > be used if that is perceived to be simpler. hmmm.. at a first glance, yes, it looks simpler. And indeed CQL is a nice terse way of creating a Filter on test cases. When it comes to actual application code, where the inputs are dynamically obtained from user input, it's not so. You'd need to deal with string concatenation and proper parameter escaping to make sure the resulting CQL is well formed. > > I don't see many people implementing lots of catalog subsystem > implementations, > and those people will likely deal with GeoServer in other ways so they will > have a passing familiarity with OGC filter concepts already. > Having to learn another API actually makes things more confusing, you > need to remember what each API does and how. I see your point, but think it has little applicability. We're not changing the meaning of equals, isNull, exists, contains (can rename it iLike looks more familiar), and, and or. > > Encoding wise, we already have a lot of code that allows to split filter and > encode > them in SQL and other languages, which means we have examples on > how things are done. Filter splitters do not make any use of the feature > type, they only know about the filter types listed in a filter capabilities > object > (and they can be subclassed to allow more targeted checks), > and filter encoders are something you need to prime often with a feature > type, > while in this case you'll have to prime them with the bean class. > I don't see the difference nor the difficulty. > > Lack of spatial filter is problematic because of possible > CSW implementations, but also for security subsystems that often > express spatial constraints on the data (and layers) you can actually > see, not having an efficient way to make them run looks like a serious > drawback to me. > > But also think about the case in which you are not doing multitenancy, > but you do have tons of layers. I know of one installation in Italy at > a research center that had, one year ago, 160k layers registered, > with new ones showing up every day. > Think how useful it would be for a case like this to be be able to pass down > a CQL filter on the GetCapabilies to get a more focused capabilities > document > for WMS/WFS/WCS usage. > > About the GUI filter that is not implementable efficiently in Predicate, > I did not notice "contains" is a well known filter, so sorry about that one, > reviewing the whole work in just half a day means I could not actually read > everything line by line (and often I had difficulties understanding what the > code did, see also the other mail about catalog implementations, but > generally speaking the proposal was rich in terms of describing api and > architectural concepts and poor in terms of describing how things are done, > which is equally important for something that aims at being committed). Good advise. I'll try to point out how things are implemented on the proposal wherever it's appropriate. A point can be made that the community module is not strictly part of the proposal, but an API validation AIM, as we're not proposing to replace the default catalog by it. But in any case I see what you mean and seems valuable feedback to me. Cheers, Gabriel. > > About the relationship between catalog and data stores I don't want in > any way impose it, nor I want to have abstraction layers be broken, > I simply recognize OGC filter as a generic data access filtering API > while it seems that you see it as something specific to GeoTools... but > it's not, it's not meant to be, using property extractors you can actually > have it filter whatever, features, beans, hashmaps, spatial stuff and > non spatial one. > > The thing could then go two ways. An easy nice to have is to > be able to build a store on top of the catalog that would allow > to display WMS maps of where the layers are, and search over > the catalog via simple WFS (which can be a nice way to allow > someone that wants to search into the server without having to > build or use a CSW client). In both cases having spatial filters would > be really handy, but in general the richness of OGC filter would > allow for complex searches to be made fast. > > The other direction is to be able to build a catalog the other way, that is, > build it on top of a data store. Now, I'm not sure it would be great, and > we may not want to use it, but it would likely make for a quick to > implement spatially searchable catalog. > (mind, this is not the strong argument, I'm actually just thinking out loud, > the strong argument is that the filter API is good, tested, well known, > rich and flexible, Predicate is none of that, the rest of the arguments > are just topping on the cake). > Feature types could be created by flattening the objects and creating > the feature type by reflection. Of course this would break the moment > new attributes show up, but we can call updateSchema() to add those, right? > It would also make for a more "relational" setup on DBMS storage, which > many people would feel more comfortable with, and would make it easier > for other applications to directly edit the persisted catalog (something I'm > sure many people will want to do). > > Cheers > Andrea -- Gabriel Roldan OpenGeo - http://opengeo.org Expert service straight from the developers. ------------------------------------------------------------------------------ Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ _______________________________________________ Geoserver-devel mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/geoserver-devel
