Re: Ideas for UIMA v3

Michael Tanenblatt Fri, 24 Jul 2015 05:24:09 -0700

> On Jul 24, 2015, at 8:08 AM, Joern Kottmann <[email protected]> wrote:
> 
> On Fri, Jul 24, 2015 at 12:46 PM, Richard Eckart de Castilho <[email protected] 
> <mailto:[email protected]>
>> wrote:
> 
>> On 23.07.2015, at 19:17, Joern Kottmann <[email protected]> wrote:
>> 
>>>> If this is the scenario, another option would be to have the serialized
>> CASes
>>>> stored along with a reference to their type system, and have some new
>>>> deserialization capability be able to locate the referred-to type
>> system along
>>>> with the CAS to be read in.  Would that "solve" this issue, or are
>> there other
>>>> aspects?
>> 
>> https://issues.apache.org/jira/browse/UIMA-2127 ;)
>> 
>> But having the TS stored alongside the CAS also is nice - see below.
>> 
>>> It would probably solve it, but it is not a simple solution either. That
>>> would mean that the Type System get switched frequently and have be
>>> looked up all the time.
>> 
>> For DKPro Core, I have implemented a BinaryCasWriter that stores the type
>> system in the same file as the binary serialized CAS. It is not always the
>> best solution because it adds a fixed overhead to every file, but it is
>> very convenient. Optionally, the type system can be stored externally in a
>> separate file to avoid this overhead. If and how this typesystem can be
>> used depends on which of the six kinds of binary serialization is being
>> used. See [1] for an overview over these formats and their properties.
>> 
>> 
> We have a few hundred million documents in the system, storing the ts with
> each document would be wasteful. It needs storage and it has to be parsed
> for each CAS.
> 
> 
> 
>> In the BinaryCasReader, depending on the type of serialization, either:
>> - there is a failure if the pipeline CAS typesystem is not compatible with
>> the persisted CAS;
>> - the type system in the pipeline CAS is reinitialized from the persisted
>> CAS;
>> - the data from the persisted CAS is loaded leniently, dropping all FSes
>> that are not defined in the pipeline CAS typesystem
>> 
>> Furthermore, the BinaryCasReader auto-detects the binary format and loads
>> it, be it the Java serialization-based format or one of the binary formats
>> that Marschall recently created, or our extended format that also embeds
>> the typesystem in the file.
>> 
>> Mind that depending on the use-case a different kind of serialization may
>> be appropriate.
>> 
>> For me, this covers in particular the following use-cases:
>> 
>> - fast (de)serialization of the entire CAS
>> - compact binary format (some more some less)
>> - stable FS addresses (in some formats)
>> - restoring the pipeline CAS type system from file (i.e. CAS can be
>> initialized with an empty type system on creation and TS is set by reader -
>> in some formats)
>> - lenient loading of data allowing for different TSes on disk and in
>> pipeline (in some formats)
>> 
>> Would such an approach cover (some of your) use-cases?
>> 
> 
> 
> With the current design the best option is probably to store a type system
> id with the document.




Agreed with this—where the type system ID is a URI


> It would be nice to avoid that additional complexity.
> 
> I think I have mainly two cases I can't really deal with:
> - A CAS contains FSes of many types. I know a few of those types and would
> like to only work with them. Not interested at all in the FSes with other
> types.
> - A CAS contains FSes of many types. I just want to deal with them as if
> they have a certain super-type. That could be FeatureStructure or
> AnnotationFS.
> 
> The CASes above have been produced by many different AAEs with similar, but
> slightly different type systems.

Right, those are typical issues that people will commonly need to surmount, 
particularly the second, where the super-type is some relatively generic type 
(e.g., Token).

..m

Re: Ideas for UIMA v3

Reply via email to