> On Jul 24, 2015, at 8:08 AM, Joern Kottmann <[email protected]> wrote: > > On Fri, Jul 24, 2015 at 12:46 PM, Richard Eckart de Castilho <[email protected] > <mailto:[email protected]> >> wrote: > >> On 23.07.2015, at 19:17, Joern Kottmann <[email protected]> wrote: >> >>>> If this is the scenario, another option would be to have the serialized >> CASes >>>> stored along with a reference to their type system, and have some new >>>> deserialization capability be able to locate the referred-to type >> system along >>>> with the CAS to be read in. Would that "solve" this issue, or are >> there other >>>> aspects? >> >> https://issues.apache.org/jira/browse/UIMA-2127 ;) >> >> But having the TS stored alongside the CAS also is nice - see below. >> >>> It would probably solve it, but it is not a simple solution either. That >>> would mean that the Type System get switched frequently and have be >>> looked up all the time. >> >> For DKPro Core, I have implemented a BinaryCasWriter that stores the type >> system in the same file as the binary serialized CAS. It is not always the >> best solution because it adds a fixed overhead to every file, but it is >> very convenient. Optionally, the type system can be stored externally in a >> separate file to avoid this overhead. If and how this typesystem can be >> used depends on which of the six kinds of binary serialization is being >> used. See [1] for an overview over these formats and their properties. >> >> > We have a few hundred million documents in the system, storing the ts with > each document would be wasteful. It needs storage and it has to be parsed > for each CAS. > > > >> In the BinaryCasReader, depending on the type of serialization, either: >> - there is a failure if the pipeline CAS typesystem is not compatible with >> the persisted CAS; >> - the type system in the pipeline CAS is reinitialized from the persisted >> CAS; >> - the data from the persisted CAS is loaded leniently, dropping all FSes >> that are not defined in the pipeline CAS typesystem >> >> Furthermore, the BinaryCasReader auto-detects the binary format and loads >> it, be it the Java serialization-based format or one of the binary formats >> that Marschall recently created, or our extended format that also embeds >> the typesystem in the file. >> >> Mind that depending on the use-case a different kind of serialization may >> be appropriate. >> >> For me, this covers in particular the following use-cases: >> >> - fast (de)serialization of the entire CAS >> - compact binary format (some more some less) >> - stable FS addresses (in some formats) >> - restoring the pipeline CAS type system from file (i.e. CAS can be >> initialized with an empty type system on creation and TS is set by reader - >> in some formats) >> - lenient loading of data allowing for different TSes on disk and in >> pipeline (in some formats) >> >> Would such an approach cover (some of your) use-cases? >> > > > With the current design the best option is probably to store a type system > id with the document.
Agreed with this—where the type system ID is a URI > It would be nice to avoid that additional complexity. > > I think I have mainly two cases I can't really deal with: > - A CAS contains FSes of many types. I know a few of those types and would > like to only work with them. Not interested at all in the FSes with other > types. > - A CAS contains FSes of many types. I just want to deal with them as if > they have a certain super-type. That could be FeatureStructure or > AnnotationFS. > > The CASes above have been produced by many different AAEs with similar, but > slightly different type systems. Right, those are typical issues that people will commonly need to surmount, particularly the second, where the super-type is some relatively generic type (e.g., Token). ..m
