Am 14.01.2013 um 16:14 schrieb Neal R Lewis <[email protected]>: >> The way you put it, it appears that XMI provides for fine grained queries >> while the binary CAS does not. However, there is no support for fine grained >> access in either formats (deliberately ignoring that XMI is an XML format >> and could be stored in an XML database providing for fine-grained access). > > >> Are there additional requirements hidden in the FSID format? I could imagine: > >> - ability to get all FSes produced by a certain annotator across all CASes >> in all collections or in a certain collection > >> - ability to get all CASes in a collection > >> - ability to get all CASes > >> Where do you get the annotatorId from? I see no sensible way that the UIMA >> framework can provide such an ID. There is also a conflict potential. >> Consider if analysis engine A creates an FS and analysis engine B updates a >> primitive feature in that FS. Assuming that primitives do not get an FSID >> since they are not FSes, should the annotatorID of the FS be updated to B or >> should it remain A? > > We currently assign our own annotator IDs. Annotator IDs are meant to > distinguish between different UIMA applications, not individual AEs in an > aggregrate or within the same pipeline. I meant to use term AnalyticID, > which is more precise, so I'll use that from now on. In the strictist sense, > you can imagine different Analytic IDS between different PEARs written by > separated developers for separate annoators, but run along the same CAS. > Now, a conflict might occur if they both had the same Type System and > annotated the same CAS. This would result in a duplicate annoation for that > type, but not for a FS because the FS would have a different FSID associated > with. > > So, if only one uima application is performing an application, then the > AnalyticID can remain stable throughout operations, perhaps with a default > value if only one uima application is ran. > > A benefit of having a CAS store is more than archiving information, but to > track a CAS's trajectory through analytics. It allows analytics to be > developed disparitly. When a new analytic is developed (and here, I mean a > new PEAR that will most likely run in a new JVM), we can run it on a old > collection of CASes. If an analytic is small and only needs one or two > objects from the CAS, then we can reduce message size by retrieving only > those objects which are necessary. FSIDs and XMI serialization allow us to > do this. > > With FSIDs, there is a possibility to query for a particular element or > groups of elements, or even multiple CASes. Some queries are more complex > than others (like getting all annotations from a particular annotator across > CASes) but still manageable. I'll try to illustrate with an example from a > hypothetical deserialized CAS: > > The following CAS was queried for an fsid with Collection ID of 3, artifcat > ID of 15, and AnalyticID of 6000. The 6000 analytic looked for sentences > like "lvef of 30%" and annotated for the sentence and value of the lvef > (ignore for now the cop element, that is a an element we use to track > provenance): > <?xml version="1.0" encoding="UTF-8" standalone="no"?> > <xmi:XMI xmlns:xmi="http://www.omg.org/XMI" > xmlns:cas="http:///uima/cas.ecore" > xmlns:cdts="http:///org/test/health/cdts.ecore" > xmlns:tcas="http:///uima/tcas.ecore" xmi:version="2.0"> > <cdts:LVEF begin="449" value="20-25%" cop=".3.15.1000.1" end="490" > fsid=".3.15.6000.1" sofa="1" xmi:id="13"/> > </xmi:XMI> > > This Cas Fragment is given what we call a "transient View" or projection > during a preprocessing step before running through a PEAR in UIMAj (which we > externall assign an AnalyticID of 6003) that will look for the value of the > LVEF , map it to a term, filter for only new objects in the CAS, and then > put back into the Store, where the store writes in a new fsid: > > <?xml version="1.0" encoding="UTF-8" standalone="no"?> > <xmi:XMI xmlns:xmi="http://www.omg.org/XMI" > xmlns:cas="http:///uima/cas.ecore" > xmlns:cdts="http:///org/test/health/cdts.ecore" > xmlns:tcas="http:///uima/tcas.ecore" xmi:version="2.0"> > <cdts:SeverelyDepressedLVEF begin="449" time="2001-12-31T12:00:00" > value="20-25%" cop=".3.15.6000.1" end="490" fsid=".3.15.6003.1" sofa="1" > xmi:id="13"/> > </xmi:XMI> > > This is what I mean by fine grained queries using FSID. We can also image a > coarse query for a Collection 3, CAS 15 (fsid like '.3.15.%') , which will > produce a full CAS. > > Does this answer some of your question?
Thanks, that answers it, I think. I'll try to sum up the aspects related to the specification of a CASStore to make sure I understood it correctly: - The AnalysisId is basically just another FS that can be queried for. - It is supplied by the application using the CASStore and it is oblique to the CASStore. - The AnalysisId may contain IDs that the application obtains from the CASStore, such as the CAS ID and possibly the Collection ID, although this is not necessary. - The CASStore would still work fine, even if a CAS did not contain any AnalysisIds. So it is rather a technique an application would use than a feature of the CASStore to support this AnalysisId. Is this a correct interpretation? Best, -- Richard -- ------------------------------------------------------------------- Richard Eckart de Castilho Technical Lead Ubiquitous Knowledge Processing Lab (UKP-TUD) FB 20 Computer Science Department Technische Universität Darmstadt Hochschulstr. 10, D-64289 Darmstadt, Germany phone [+49] (0)6151 16-7477, fax -5455, room S2/02/B117 [email protected] www.ukp.tu-darmstadt.de Web Research at TU Darmstadt (WeRC) www.werc.tu-darmstadt.de -------------------------------------------------------------------
