Hi Johann, First off, welcome to the list!! :)
More comments below: On 1/21/13 2:13 AM, "johann sorel" <[email protected]> wrote: >Hello everyone, > >Sorry for the late answer, I wasn't yet registered on this mailing list. >Here is a quick introduction since martin already talked about me : >I'm Johann Sorel from the same company and working on the geotoolkit >project too, I mainly work on data reader/writer, rendering engines and >swing user interfaces but also a bit on everything : >metadata,coverage,security,web services. > >I have being looking at the Tika project, I never used it so correct me >if I say something wrong. > From what I see it is limited to Metadata reading only and reduced to >file types. >Writing is also something the Apache SIS project should provide so I >believe SIS should have a higher level api that Tika could implement. Yep indeed. Tika doesn't provide facilities for writing files -- it's an API and implementation for: * automatic file identification/detection and classification based on IANA std mime-types and others * mime-repository * integration of existing parsing toolkits to extract metadata and/or text extraction * language identification > >About data source, I propose a different approach : Java Content >Repository version 2 (JCR) specification (JSR 170 and 283) >A possible implementation is Apache JackRabbit : >http://jackrabbit.apache.org Yep I know Jukka who used to be their VP, and have followed their development it's a great product. >While Tika might be interesting for metadata, the JCR specification >defines apis for reading, writing and queries. >Beside the community using JCR is far larger then Tika or GDAL, to name >some of them : LifeRay, Exoplatform, Oracle beehive, Hippo CMS, ... I have to say I'm not sure that the community using JCR is larger than Tika or GDAL -- which themselves have pretty wide infection as well into some of those same systems. >Reusing the same or a similar model would simplify the integration of >the SIS model in existing applications >and we would benefit from the expertise already made in this >specification. >The JCR model is very similar to features, it has Nodes and NodeTypes >which I believe might be useable for metadata too. Using the JCR may help us integrate better into some of the applications we want to target, for sure. > >Filter would be placed just before datasource since it should have a >query api which use filters. > >If I can make an global view of the solution we have so far : >(I won't talk about referencing, martin has much more knowledge then me >on this topic) > >1) we have 3 base storage atoms : Metadata, Feature(and underneath >Geometry), Coverage > --> defined by several OGC/ISO specifications >2) to interrogate them we can use : Filter, Expression, Query > --> defined by OGC(exist in geoapi-pending) Query --> defined in >JCR >3) to manage/query/analyze them : Repository/DataSource/DataStore > --> can be based on JCR , GDAL ,tika models or a mix >4) to render the datas : style model, Map model > --> can be OGC SLD/SE(exist in geoapi-pending), could also be some >kind of CSS , > -->the map model could be OGC WMC but this spec is limited to web, it >would require some improvements. This sounds great to me. I'd be happy also to figure out where Tika fits -- probably in the Metadata model. > >Some of those solutions are already implemented and have been properly >separated >in interfaces (geoapi-pending) and implementations (geotoolkit-pending) >so it could be used as a starting point. Great, looking forward to it! Please feel free to file some JIRA issues, and to get started! We'd welcome you here in the SIS community! Cheers, Chris > > >Johann Sorel >Geomatys > > > > > >-------------------------------------------------------------------------- >----- >Hey Martin, > >On 1/18/13 12:12 PM, "Martin Desruisseaux" ><[email protected]> wrote: > > >Le 18/01/13 11:31, Adam Estrada a écrit : > >> Spot on with Tika being an SIS dependency, Martin! The idea is to be > >>able > >> to extract content from as may file formats as possible based on their > >>MIME > >> types. GDAL provides the interface to a lot more geospatial formats. > > > >We have the notion of "data source" interface (not yet committed), and > >Tika or GDAL can be one of them. GeoTIFF, NetCDF, etc. are other data > >sources (we have some extra flexibility if we read NetCDF files directly > >rather than through GDAL for instance, but we would do that only for the > >most important formats instead than duplicating the totality of GDAL). > >However "data sources" appear downstream relative to metadata and other > >basic modules. A list of modules in approximative dependency order can >be: > > > > - utility > > - metadata > > - referencing > > - geometry > > - feature > > - coverage > > - data source <-- Tika/GDAL can be plugged here > > - styles > > - renderer > >+1 that makes sense to me. > >Note I also believe there is another dependency from Tika to SIS >(especially for the WKT parsing). > > > > >I'm not sure if "filter" would be before or after "data source" - Johann > >Sorel would known better (I think he is watching this list, even if he > >didn't sent emails yet). > >Come on Johann, come out and say hi! :) > > > > >Actually the "sis-metadata" module being built is not about arbitrary > >metadata, but rather about the "lingua franca" to be used in SIS for > >metadata. Many metadata model could be choose for this purpose, but the > >proposed SIS approach is to select ISO standards as the lingua franca. > >All other sources of metadata would need to be converted to ISO 19115 > >before to be used in a source-independent way by all SIS modules. This > >is the purpose for instance of the NetCDF - ISO mapping mentioned in > >previous email. This explain why "data source", which is where > >input/output happen, is so far away from metadata in the above > >dependency chain; all preceding modules define the models which will > >represent the data read by the data sources. > >It would be great to use Tika to convert *insert format here* to ISO 19115 >if possible. > > > > >Obviously the XML (un)marshalling is an exception to what I just said, > >since it is defined straight in the core metadata module instead than as > >a data source. But we should have (I hope) few such exceptions. This > >exception exists for two reasons: 1) as a side effect of the way JAXB > >works (annotations straight in the source code), and 2) because while > >ISO 19115 would be the "lingua franca" for the conceptual model, XML is > >the "lingua franca" for the file format at least at OGC/ISO/INSPIRE, so > >maybe it deserves that special treatment... > >+1. > >Cheers, >Chris > > > > > Martin > > >
