[jira] [Comment Edited] (UIMA-6266) Clean JSON Wire Format for CAS

Richard Eckart de Castilho (Jira) Tue, 13 Oct 2020 09:49:13 -0700


    [ 
https://issues.apache.org/jira/browse/UIMA-6266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17213212#comment-17213212
 ]


Richard Eckart de Castilho edited comment on UIMA-6266 at 10/13/20, 4:48 PM:
-----------------------------------------------------------------------------

Here some thoughts:

* {{_type}} is IMHO metadata, but it is also a name which could appear in that 
way in a Java class as a field name. I'd prefer to use a naming convention 
which does not potentially clash with Java (and maybe other languages) and 
which might also be used elsewhere in similar JSON formats. E.g. 
[JSON-LD|https://json-ld.org] prefixes such kinds of JSON fields with an {{@}}. 
* We need some kind of mapping from the {{type}} to the fully qualified type 
name (the one which includes the namespace / Java package). That is also 
necessary to handle cases where a type with the same base name exists in 
multiple namespaces / Java packages.
* For lat/long, we know that it is a floating point number, but we don't know 
which one. It would be good if the data format (somewhere outside the FS 
representation) had information on if this is a float, double or something 
else. I think there exist some conventions for encoding field type information 
in JSON (maybe like {{\{ "lat:f": 49.123, 'lon:d': -84,234 \}}}, but I cannot 
find anything right now. In any case, having this information in some "schema" 
part of the file may be preferable to avoid unnecessary redundancy. There is an 
implicit question here how much redundancy is acceptable. E.g. for a proper 
"wire" (network) format, having the feature names repeated in every FS might 
not be desirable - if the data could be represented as a JSON array, it could 
be much more compact (assuming that most fields have non-null/default values). 
But it makes the format more difficult to processes. Where is the sweet spot 
between wire size and ease of access?
* The FSID here seems to be an actual feature of the feature structure (i.e. 
not "just" metadata used for references between FSes). We'd need an  {{@id}} 
for FSes as well to allow making cross-references between them.
* This goes a bit beyond the simple FS discussion, but I think it might be 
worth grouping FSes by their types in the JSON format:
* The "spannedText" here also looks like a "true" feature - or is it part of 
the format? Normally, I would expect this to be inferable from the document 
text and the offsets so per-se. The question here might be if the format should 
support partial transmission of the document data.

{noformat}
[{
  "@type": "my.world.Geo",
  "instances": [
    {"@id": 1, "begin":10, "end":12, "spannedText":"NY", "lat":40.7128, 
"lon":-74.006, "fsid":13},
    ...
  ]
}, {
  "@type": "my.world.Country",
  "instances": [
    {"@id": 100, "begin":5, "end":8, "spannedText":"USA", fsid: "162"},
    ...
  ]
}]
{noformat}


was (Author: rec):
Here some thoughts:

* {{_type}} is IMHO metadata, but it is also a name which could appear in that 
way in a Java class as a field name. I'd prefer to use a naming convention 
which does not potentially clash with Java (and maybe other languages) and 
which might also be used elsewhere in similar JSON formats. E.g. 
[JSON-LD|https://json-ld.org] prefixes such kinds of JSON fields with an {{@}}. 
* We need some kind of mapping from the {{type}} to the fully qualified type 
name (the one which includes the namespace / Java package). That is also 
necessary to handle cases where a type with the same base name exists in 
multiple namespaces / Java packages.
* For lat/long, we know that it is a floating point number, but we don't know 
which one. It would be good if the data format (somewhere outside the FS 
representation) had information on if this is a float, double or something 
else. I think there exist some conventions for encoding field type information 
in JSON (maybe like {{\{ "lat:f": 49.123, 'lon:d': -84,234 \}}}, but I cannot 
find anything right now. In any case, having this information in some "schema" 
part of the file may be preferable to avoid unnecessary redundancy. There is an 
implicit question here how much redundancy is acceptable. E.g. for a proper 
"wire" (network) format, having the feature names repeated in every FS might 
not be desirable - if the data could be represented as a JSON array, it could 
be much more compact (assuming that most fields have non-null/default values). 
But it makes the format more difficult to processes. Where is the sweet spot 
between wire size and ease of access?
* The FSID here seems to be an actual feature of the feature structure (i.e. 
not "just" metadata used for references between FSes). We'd need an  {{@id}} 
for FSes as well to allow making cross-references between them.

> Clean JSON Wire Format for CAS
> ------------------------------
>
>                 Key: UIMA-6266
>                 URL: https://issues.apache.org/jira/browse/UIMA-6266
>             Project: UIMA
>          Issue Type: New Feature
>          Components: Core Java Framework
>            Reporter: Daniel Gruhl
>            Priority: Major
>
> A clean format for sending CAS over the wire in JSON would make 
> interoperation with other text analytics systems much easier. Impact on UIMAj 
> would be a need for the serializer and deserializer for these formats.
>  
> The hope would be this is NOT just a cut and past of the XMI, but rather a 
> clean rethink of what would represent the best wire format going forward.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (UIMA-6266) Clean JSON Wire Format for CAS

Reply via email to