(uima-uimaj-io-jsoncas) 01/01: No issue: Comment out alternative sections, future work and clarify UTF-16 code units as the offset counting strategy

rec Thu, 22 Feb 2024 22:58:32 -0800

This is an automated email from the ASF dual-hosted git repository.

rec pushed a commit to branch no-issue-clean-up-spec
in repository https://gitbox.apache.org/repos/asf/uima-uimaj-io-jsoncas.git


commit 7334ae5bfe99cd7957b809eee33979c64ab34d42
Author: Richard Eckart de Castilho <[email protected]>
AuthorDate: Fri Feb 23 07:56:40 2024 +0100

    No issue: Comment out alternative sections, future work and clarify UTF-16 
code units as the offset counting strategy
---
 SPECIFICATION.adoc | 121 ++++++++++++++++++++++++++++++++++++++++-------------
 1 file changed, 93 insertions(+), 28 deletions(-)

diff --git a/SPECIFICATION.adoc b/SPECIFICATION.adoc
index 8cee138..39c5d38 100644
--- a/SPECIFICATION.adoc
+++ b/SPECIFICATION.adoc
@@ -20,29 +20,9 @@
 
 = JSON serialization of the Apache UIMA CAS
 
-== Getting started
+This document defiens a JSON-based serialization format for the UIMA CAS. This 
format provides the new go-to solution for encoding UIMA CAS data and to 
facilitate working with such data cross-platform and cross-programming 
languages.
 
-.Serializing a CAS to JSON
-[source,java]
-----
-import org.apache.uima.json.jsoncas2.JsonCas2Serializer
-
-CAS cas = ...;
-new JsonCas2Serializer().serialize(cas, new File("cas.json"));
-----
-
-.De-serializing a CAS from JSON
-[source,java]
-----
-import org.apache.uima.json.jsoncas2.JsonCas2Deserializer;
-
-CAS cas = ...; // The CAS must already be prepared with the type system used 
by the CAS JSON file
-new JsonCas2Deserializer().deserialize(new File("cas.json"), cas);
-----
-
-== Specification
-
-This document introduces a new JSON-based serialization format for the UIMA 
CAS. The new format aims to provide the new go-to solution for encoding UIMA 
CAS data and to facilitate working with such data cross-platform and 
cross-programming languages.
+== Motivation
 
 For the most part, the UIMA CAS 
XMIfootnote:[https://uima.apache.org/d/uimaj-current/references.html#ugr.ref.xmi]
 format has been the de-facto standard representation of UIMA data. However, 
the format has several short-comings:
 
@@ -75,8 +55,11 @@ The new UIMA JSON CAS format should meet the following 
requirements:
 * contain all information required to parse it
 * contain all information contained in the UIMA CAS
 * preserve all information across a (de)serialization cycle
-* avoid ambiguities footnote:[Note that this *draft* document will often 
propose
+* avoid ambiguities
+//// 
+footnote:[Note that this *draft* document will often propose
   alternative data representations. The idea is to consider them and to 
eventually argue for a canonical representation.]
+////  
 * maybe to show a comparable (or even a better) performance in terms of size 
and speed
 
 === UIMA CAS entities
@@ -134,11 +117,13 @@ Keys that have reserved names in the CAS JSON format 
always start with a KEYWORD
 
 Keyword fields must always precede user-definable fields in the serialized 
JSON objects. Additionally, there may be specific order requirements on the 
keyword fields themselves.
 
+////
 .Alternative suggestions:
 * The KEYWORD_MARKER should be `_` - however, `_` is a valid identifier 
character
 * The keys should not be upper-case but rather lower-case, camel-case, or 
kebab-case
 * The JSON structure should be defined such that user-defined and predefined 
keys are
   clearly separated from each other. Any object contains either only 
user-definable keys or only predefined keys. E.g. in a feature structure, there 
should be an explicit key `features` under which all user-definable features 
are located.
+////
 
 === CAS
 
@@ -165,11 +150,13 @@ To facilitate the implementation of streaming parsers, 
the fields should be enco
 . *Views:* provides information about the namespaces into which the feature 
structures 
   have been organized. In particular, the views section may provide 
information about the existence of a view even if that view has no member 
feature structures. Each view contains a list of members referring to feature 
structures from the previous section.
 
+////
 .Alternative suggestions:
 * The view section should contain an array pointing to the members of the 
view. The 
   views section should then precede the feature structures section such that 
the parser already knows to which view a feature structure should be added when 
it encounters the feature structure.
 * All three sections could in principle be optional. A UIMA JSON CAS 
containing only a 
   types section is essentially the equivalent of an XML type system 
description. A JSON CAS only containing feature structures could be sufficient 
if we assume that all these feature structures would be indexed by default in 
the default view. The views section would not be required if the CAS only 
contains the predefined default view.
+////
 
 === Header
 
@@ -181,8 +168,10 @@ The header provides information to the parser on how to 
parse the UIMA JSON CAS.
 |`%VERSION` |UIMA CAS JSON specification version to which the JSON document 
adheres |"1.0.0"
 |===
 
+////
 .Alternative suggestions:
 * Simply keep the header keys at the top-level without introducing a header 
section.
+////
 
 === Type System
 
@@ -196,15 +185,16 @@ This section encodes the type system definition. Every 
type can only be defined
 }
 ----
 
+////
 .Alternative suggestions:*
 * Instead of encoding only the essential type information, it could be 
considered to 
   permit extended type system information, in particular the ability to 
represent multiple type systems along with version information, vendor 
information, documentation, etc.
 * Allow importing type systems through a reference to a URL/URI.
+////
 
+==== Types
 
-==== Type descriptions
-
-UIMA type descriptions are described in the Apache UIMA Java SDK reference 
documentationfootnote:[https://uima.apache.org/d/uimaj-current/references.html#ugr.ref.xml.component_descriptor.type_system]
 and we largely follow that specification. According to that specification, a 
type description consists of:
+UIMA types are described in the Apache UIMA Java SDK reference 
documentationfootnote:[https://uima.apache.org/d/uimaj-current/references.html#ugr.ref.xml.component_descriptor.type_system]
 and we largely follow that specification. According to that specification, a 
type description consists of:
 
 *  *Type name:* identifier of the type in a `<namespace>.<name>` notation.
 * *Description (optional):* documentation for the type
@@ -223,9 +213,9 @@ UIMA type descriptions are described in the Apache UIMA 
Java SDK reference docum
 }
 ----
 
-==== Feature descriptions
+==== Features
 
-Similarly, UIMA feature descriptions are described in the Apache UIMA Java SDK 
reference 
documentationfootnote:[https://uima.apache.org/d/uimaj-current/references.html#ugr.ref.xml.component_descriptor.type_system]
 as consisting of:
+Similarly, UIMA features are described in the Apache UIMA Java SDK reference 
documentationfootnote:[https://uima.apache.org/d/uimaj-current/references.html#ugr.ref.xml.component_descriptor.type_system]
 as consisting of:
 
 * *Feature name:* the identifier of the feature
 * *Description (optional):* documentation for the feature
@@ -248,6 +238,8 @@ Similarly, UIMA feature descriptions are described in the 
Apache UIMA Java SDK r
 
 For simplicity, the UIMA JSON CAS format ignores the *Multiple references 
allowed* flag and always represents arrays as separate feature structures.
 
+////
+
 .Alternative suggestions:
 * Instead of using the full type name as the key in the type system JSON 
object, an ID 
   or an abbreviated type name could be used. That could significantly reduce 
the JSON CAS size if the type field of the feature structures referred to the 
short name/ID. Similarly for the features.
@@ -258,6 +250,8 @@ For simplicity, the UIMA JSON CAS format ignores the 
*Multiple references allowe
 * UIMAv3 has started using reified array types and introduced a new writing 
convention 
   for them using `[]` as a suffix: `uima.tcas.Annotation[]`, 
`uima.cas.Integer[]`. So we could consider abandoning the concept of an array 
element type in the type system section of the CAS JSON format and simply use 
the `<type>[]` convention to represent arrays of a given type. That would make 
the type system section more compact because we can entirely omit the 
`%ELEMENT_TYPE` key. The `%ELEMENT_TYPE` could still be required for other 
"generic" container types such as FSList unless we  [...]
 
+////
+
 .Notes:
 * The Apache UIMA Java SDK does currently discard the type and feature 
descriptions when 
   creating a `TypeSystemImpl` instance. Thus, the descriptions are generally 
lost when a type system is recovered from the CAS for serialization. To meet 
the requirement that no information is lost, the Apache UIMA Java SDK 
implementation would need to be extended to allow preserving the descriptions.
@@ -275,6 +269,7 @@ The feature structures section contains the actual feature 
structures. The secti
 ]
 ----
 
+////
 .Alternative suggestions:
 * It could be implemented as a JSON map using the feature structure ID as its 
key and 
   the feature structure as values.
@@ -293,6 +288,7 @@ The feature structures section contains the actual feature 
structures. The secti
 |We can more "naturally" define a reduced form of the UIMA JSON CAS which 
consists only of the feature structure array. A parser can easily distinguish 
between a full JSON CAS and the reduced form by checking if the first JSON 
token is an array-start or an object-start token. 
 |
 |===
+////
 
 ==== Feature structure representation
 
@@ -408,6 +404,10 @@ In general, the go-to standard for characters is the 
Unicode standardfootnote:[h
 
 To identify features whose values may need a conversion during 
(de)serialization, the anchor marker `^` was introduced (cf. section on "Anchor 
features" above).
 
+Character offsets used in the JSON format are expected to be based on the 
*UTF-16 code units*. Futher versions of the specification may define a metadata 
key to be included in the JSON file that could be used to indicate a different 
base. This is the native character offset base in languages with as Java or 
JavaScript. Implementations in languges that use a different native character 
counting (e.g. Python) need to convert from/to UTF-16 code unit offsets when 
reading/writing the JSON CAS [...]
+
+////
+
 *_Note: the draft specification currently does not prefer any particular 
encoding scheme. Please refer to the alternative suggestions below and provide 
feedback._*
 
 *Alternative suggestions:*
@@ -425,6 +425,10 @@ To identify features whose values may need a conversion 
during (de)serialization
 * There is a header key in the CAS which specifies which anchor encoding is 
being used 
   (i.e. UTF-8, UTF-16, UTF-32/codepoints or grapheme clusters - the latter 
possibly along with a Unicode version number and possible with some closer 
description of which Unicode library and version of that library was being 
used). If the header is absent, a default encoding is prescribed by UIMA JSON 
CAS.
 
+////
+
+////
+
 == Future(!) directions
 
 This draft specification of the UIMA JSON CAS format tries to iron out the 
most basic aspects of the format. However, there are additional considerations 
on the radar which may or may not have influence on the format, even on the 
basics discussed here.
@@ -497,3 +501,64 @@ Each view has one. Theoretically there could be more than 
one, but only one is *
 .See also
 * 
https://uima.apache.org/d/uimaj-current/apidocs/org/apache/uima/jcas/tcas/DocumentAnnotation.html[+++https://uima.apache.org/d/uimaj-current/apidocs/org/apache/uima/jcas/tcas/DocumentAnnotation.html+++]
 * 
link:++https://uima.apache.org/d/uimaj-current/apidocs/org/apache/uima/jcas/JCas.html#getDocumentAnnotationFs--++[+++https://uima.apache.org/d/uimaj-current/apidocs/org/apache/uima/jcas/JCas.html#getDocumentAnnotationFs--+++]
+
+////
+
+== Implementations
+
+=== Java
+
+The Java implementation of the JSON CAS format is currently provided by the 
Apache UIMA project.
+
+.Maven dependency
+[source,xml]
+----
+<dependency>
+  <groupId>org.apache.uima</groupId>
+  <artifactId>uimaj-io-json</artifactId>
+  <version>[USE LATEST VERSION]]</version>
+</dependency>
+----
+
+.Reading a JSON CAS file
+[source,java]
+----
+import org.apache.uima.json.jsoncas2.JsonCas2Serializer
+
+CAS cas = ...;
+new JsonCas2Serializer().serialize(cas, new File("cas.json"));
+----
+
+.Writing a JSON CAS file
+[source,java]
+----
+import org.apache.uima.json.jsoncas2.JsonCas2Deserializer;
+
+CAS cas = ...; // The CAS must already be prepared with the type system used 
by the CAS JSON file
+new JsonCas2Deserializer().deserialize(new File("cas.json"), cas);
+----
+
+=== Python
+
+The Python implementation of the JSON CAS format is currently available in 
link:https://github.com/dkpro/dkpro-cassis[DKPro Cassis]. This is a third-party 
(non-ASF) library provided under the Apache License 2.0.
+
+.Installing DKPro Cassis
+[source,sh]
+----
+pip install dkpro-cassis
+----
+
+.Reading a JSON CAS file
+[source,java]
+----
+from cassis import *
+
+with open('cas.json', 'rb') as f:
+   cas = load_cas_from_json(f)
+----
+
+.Writing a JSON CAS file
+[source,java]
+----
+cas.to_json("my_cas.json")
+----
\ No newline at end of file

(uima-uimaj-io-jsoncas) 01/01: No issue: Comment out alternative sections, future work and clarify UTF-16 code units as the offset counting strategy

Reply via email to