This is an automated email from the ASF dual-hosted git repository. rec pushed a commit to branch no-issue-clean-up-spec in repository https://gitbox.apache.org/repos/asf/uima-uimaj-io-jsoncas.git
commit 7334ae5bfe99cd7957b809eee33979c64ab34d42 Author: Richard Eckart de Castilho <[email protected]> AuthorDate: Fri Feb 23 07:56:40 2024 +0100 No issue: Comment out alternative sections, future work and clarify UTF-16 code units as the offset counting strategy --- SPECIFICATION.adoc | 121 ++++++++++++++++++++++++++++++++++++++++------------- 1 file changed, 93 insertions(+), 28 deletions(-) diff --git a/SPECIFICATION.adoc b/SPECIFICATION.adoc index 8cee138..39c5d38 100644 --- a/SPECIFICATION.adoc +++ b/SPECIFICATION.adoc @@ -20,29 +20,9 @@ = JSON serialization of the Apache UIMA CAS -== Getting started +This document defiens a JSON-based serialization format for the UIMA CAS. This format provides the new go-to solution for encoding UIMA CAS data and to facilitate working with such data cross-platform and cross-programming languages. -.Serializing a CAS to JSON -[source,java] ----- -import org.apache.uima.json.jsoncas2.JsonCas2Serializer - -CAS cas = ...; -new JsonCas2Serializer().serialize(cas, new File("cas.json")); ----- - -.De-serializing a CAS from JSON -[source,java] ----- -import org.apache.uima.json.jsoncas2.JsonCas2Deserializer; - -CAS cas = ...; // The CAS must already be prepared with the type system used by the CAS JSON file -new JsonCas2Deserializer().deserialize(new File("cas.json"), cas); ----- - -== Specification - -This document introduces a new JSON-based serialization format for the UIMA CAS. The new format aims to provide the new go-to solution for encoding UIMA CAS data and to facilitate working with such data cross-platform and cross-programming languages. +== Motivation For the most part, the UIMA CAS XMIfootnote:[https://uima.apache.org/d/uimaj-current/references.html#ugr.ref.xmi] format has been the de-facto standard representation of UIMA data. However, the format has several short-comings: @@ -75,8 +55,11 @@ The new UIMA JSON CAS format should meet the following requirements: * contain all information required to parse it * contain all information contained in the UIMA CAS * preserve all information across a (de)serialization cycle -* avoid ambiguities footnote:[Note that this *draft* document will often propose +* avoid ambiguities +//// +footnote:[Note that this *draft* document will often propose alternative data representations. The idea is to consider them and to eventually argue for a canonical representation.] +//// * maybe to show a comparable (or even a better) performance in terms of size and speed === UIMA CAS entities @@ -134,11 +117,13 @@ Keys that have reserved names in the CAS JSON format always start with a KEYWORD Keyword fields must always precede user-definable fields in the serialized JSON objects. Additionally, there may be specific order requirements on the keyword fields themselves. +//// .Alternative suggestions: * The KEYWORD_MARKER should be `_` - however, `_` is a valid identifier character * The keys should not be upper-case but rather lower-case, camel-case, or kebab-case * The JSON structure should be defined such that user-defined and predefined keys are clearly separated from each other. Any object contains either only user-definable keys or only predefined keys. E.g. in a feature structure, there should be an explicit key `features` under which all user-definable features are located. +//// === CAS @@ -165,11 +150,13 @@ To facilitate the implementation of streaming parsers, the fields should be enco . *Views:* provides information about the namespaces into which the feature structures have been organized. In particular, the views section may provide information about the existence of a view even if that view has no member feature structures. Each view contains a list of members referring to feature structures from the previous section. +//// .Alternative suggestions: * The view section should contain an array pointing to the members of the view. The views section should then precede the feature structures section such that the parser already knows to which view a feature structure should be added when it encounters the feature structure. * All three sections could in principle be optional. A UIMA JSON CAS containing only a types section is essentially the equivalent of an XML type system description. A JSON CAS only containing feature structures could be sufficient if we assume that all these feature structures would be indexed by default in the default view. The views section would not be required if the CAS only contains the predefined default view. +//// === Header @@ -181,8 +168,10 @@ The header provides information to the parser on how to parse the UIMA JSON CAS. |`%VERSION` |UIMA CAS JSON specification version to which the JSON document adheres |"1.0.0" |=== +//// .Alternative suggestions: * Simply keep the header keys at the top-level without introducing a header section. +//// === Type System @@ -196,15 +185,16 @@ This section encodes the type system definition. Every type can only be defined } ---- +//// .Alternative suggestions:* * Instead of encoding only the essential type information, it could be considered to permit extended type system information, in particular the ability to represent multiple type systems along with version information, vendor information, documentation, etc. * Allow importing type systems through a reference to a URL/URI. +//// +==== Types -==== Type descriptions - -UIMA type descriptions are described in the Apache UIMA Java SDK reference documentationfootnote:[https://uima.apache.org/d/uimaj-current/references.html#ugr.ref.xml.component_descriptor.type_system] and we largely follow that specification. According to that specification, a type description consists of: +UIMA types are described in the Apache UIMA Java SDK reference documentationfootnote:[https://uima.apache.org/d/uimaj-current/references.html#ugr.ref.xml.component_descriptor.type_system] and we largely follow that specification. According to that specification, a type description consists of: * *Type name:* identifier of the type in a `<namespace>.<name>` notation. * *Description (optional):* documentation for the type @@ -223,9 +213,9 @@ UIMA type descriptions are described in the Apache UIMA Java SDK reference docum } ---- -==== Feature descriptions +==== Features -Similarly, UIMA feature descriptions are described in the Apache UIMA Java SDK reference documentationfootnote:[https://uima.apache.org/d/uimaj-current/references.html#ugr.ref.xml.component_descriptor.type_system] as consisting of: +Similarly, UIMA features are described in the Apache UIMA Java SDK reference documentationfootnote:[https://uima.apache.org/d/uimaj-current/references.html#ugr.ref.xml.component_descriptor.type_system] as consisting of: * *Feature name:* the identifier of the feature * *Description (optional):* documentation for the feature @@ -248,6 +238,8 @@ Similarly, UIMA feature descriptions are described in the Apache UIMA Java SDK r For simplicity, the UIMA JSON CAS format ignores the *Multiple references allowed* flag and always represents arrays as separate feature structures. +//// + .Alternative suggestions: * Instead of using the full type name as the key in the type system JSON object, an ID or an abbreviated type name could be used. That could significantly reduce the JSON CAS size if the type field of the feature structures referred to the short name/ID. Similarly for the features. @@ -258,6 +250,8 @@ For simplicity, the UIMA JSON CAS format ignores the *Multiple references allowe * UIMAv3 has started using reified array types and introduced a new writing convention for them using `[]` as a suffix: `uima.tcas.Annotation[]`, `uima.cas.Integer[]`. So we could consider abandoning the concept of an array element type in the type system section of the CAS JSON format and simply use the `<type>[]` convention to represent arrays of a given type. That would make the type system section more compact because we can entirely omit the `%ELEMENT_TYPE` key. The `%ELEMENT_TYPE` could still be required for other "generic" container types such as FSList unless we [...] +//// + .Notes: * The Apache UIMA Java SDK does currently discard the type and feature descriptions when creating a `TypeSystemImpl` instance. Thus, the descriptions are generally lost when a type system is recovered from the CAS for serialization. To meet the requirement that no information is lost, the Apache UIMA Java SDK implementation would need to be extended to allow preserving the descriptions. @@ -275,6 +269,7 @@ The feature structures section contains the actual feature structures. The secti ] ---- +//// .Alternative suggestions: * It could be implemented as a JSON map using the feature structure ID as its key and the feature structure as values. @@ -293,6 +288,7 @@ The feature structures section contains the actual feature structures. The secti |We can more "naturally" define a reduced form of the UIMA JSON CAS which consists only of the feature structure array. A parser can easily distinguish between a full JSON CAS and the reduced form by checking if the first JSON token is an array-start or an object-start token. | |=== +//// ==== Feature structure representation @@ -408,6 +404,10 @@ In general, the go-to standard for characters is the Unicode standardfootnote:[h To identify features whose values may need a conversion during (de)serialization, the anchor marker `^` was introduced (cf. section on "Anchor features" above). +Character offsets used in the JSON format are expected to be based on the *UTF-16 code units*. Futher versions of the specification may define a metadata key to be included in the JSON file that could be used to indicate a different base. This is the native character offset base in languages with as Java or JavaScript. Implementations in languges that use a different native character counting (e.g. Python) need to convert from/to UTF-16 code unit offsets when reading/writing the JSON CAS [...] + +//// + *_Note: the draft specification currently does not prefer any particular encoding scheme. Please refer to the alternative suggestions below and provide feedback._* *Alternative suggestions:* @@ -425,6 +425,10 @@ To identify features whose values may need a conversion during (de)serialization * There is a header key in the CAS which specifies which anchor encoding is being used (i.e. UTF-8, UTF-16, UTF-32/codepoints or grapheme clusters - the latter possibly along with a Unicode version number and possible with some closer description of which Unicode library and version of that library was being used). If the header is absent, a default encoding is prescribed by UIMA JSON CAS. +//// + +//// + == Future(!) directions This draft specification of the UIMA JSON CAS format tries to iron out the most basic aspects of the format. However, there are additional considerations on the radar which may or may not have influence on the format, even on the basics discussed here. @@ -497,3 +501,64 @@ Each view has one. Theoretically there could be more than one, but only one is * .See also * https://uima.apache.org/d/uimaj-current/apidocs/org/apache/uima/jcas/tcas/DocumentAnnotation.html[+++https://uima.apache.org/d/uimaj-current/apidocs/org/apache/uima/jcas/tcas/DocumentAnnotation.html+++] * link:++https://uima.apache.org/d/uimaj-current/apidocs/org/apache/uima/jcas/JCas.html#getDocumentAnnotationFs--++[+++https://uima.apache.org/d/uimaj-current/apidocs/org/apache/uima/jcas/JCas.html#getDocumentAnnotationFs--+++] + +//// + +== Implementations + +=== Java + +The Java implementation of the JSON CAS format is currently provided by the Apache UIMA project. + +.Maven dependency +[source,xml] +---- +<dependency> + <groupId>org.apache.uima</groupId> + <artifactId>uimaj-io-json</artifactId> + <version>[USE LATEST VERSION]]</version> +</dependency> +---- + +.Reading a JSON CAS file +[source,java] +---- +import org.apache.uima.json.jsoncas2.JsonCas2Serializer + +CAS cas = ...; +new JsonCas2Serializer().serialize(cas, new File("cas.json")); +---- + +.Writing a JSON CAS file +[source,java] +---- +import org.apache.uima.json.jsoncas2.JsonCas2Deserializer; + +CAS cas = ...; // The CAS must already be prepared with the type system used by the CAS JSON file +new JsonCas2Deserializer().deserialize(new File("cas.json"), cas); +---- + +=== Python + +The Python implementation of the JSON CAS format is currently available in link:https://github.com/dkpro/dkpro-cassis[DKPro Cassis]. This is a third-party (non-ASF) library provided under the Apache License 2.0. + +.Installing DKPro Cassis +[source,sh] +---- +pip install dkpro-cassis +---- + +.Reading a JSON CAS file +[source,java] +---- +from cassis import * + +with open('cas.json', 'rb') as f: + cas = load_cas_from_json(f) +---- + +.Writing a JSON CAS file +[source,java] +---- +cas.to_json("my_cas.json") +---- \ No newline at end of file
