[ https://issues.apache.org/jira/browse/UIMA-6266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17213212#comment-17213212 ]
Richard Eckart de Castilho edited comment on UIMA-6266 at 10/13/20, 4:48 PM: ----------------------------------------------------------------------------- Here some thoughts: * {{_type}} is IMHO metadata, but it is also a name which could appear in that way in a Java class as a field name. I'd prefer to use a naming convention which does not potentially clash with Java (and maybe other languages) and which might also be used elsewhere in similar JSON formats. E.g. [JSON-LD|https://json-ld.org] prefixes such kinds of JSON fields with an {{@}}. * We need some kind of mapping from the {{type}} to the fully qualified type name (the one which includes the namespace / Java package). That is also necessary to handle cases where a type with the same base name exists in multiple namespaces / Java packages. * For lat/long, we know that it is a floating point number, but we don't know which one. It would be good if the data format (somewhere outside the FS representation) had information on if this is a float, double or something else. I think there exist some conventions for encoding field type information in JSON (maybe like {{\{ "lat:f": 49.123, 'lon:d': -84,234 \}}}, but I cannot find anything right now. In any case, having this information in some "schema" part of the file may be preferable to avoid unnecessary redundancy. There is an implicit question here how much redundancy is acceptable. E.g. for a proper "wire" (network) format, having the feature names repeated in every FS might not be desirable - if the data could be represented as a JSON array, it could be much more compact (assuming that most fields have non-null/default values). But it makes the format more difficult to processes. Where is the sweet spot between wire size and ease of access? * The FSID here seems to be an actual feature of the feature structure (i.e. not "just" metadata used for references between FSes). We'd need an {{@id}} for FSes as well to allow making cross-references between them. * This goes a bit beyond the simple FS discussion, but I think it might be worth grouping FSes by their types in the JSON format: * The "spannedText" here also looks like a "true" feature - or is it part of the format? Normally, I would expect this to be inferable from the document text and the offsets so per-se. The question here might be if the format should support partial transmission of the document data. {noformat} [{ "@type": "my.world.Geo", "instances": [ {"@id": 1, "begin":10, "end":12, "spannedText":"NY", "lat":40.7128, "lon":-74.006, "fsid":13}, ... ] }, { "@type": "my.world.Country", "instances": [ {"@id": 100, "begin":5, "end":8, "spannedText":"USA", fsid: "162"}, ... ] }] {noformat} was (Author: rec): Here some thoughts: * {{_type}} is IMHO metadata, but it is also a name which could appear in that way in a Java class as a field name. I'd prefer to use a naming convention which does not potentially clash with Java (and maybe other languages) and which might also be used elsewhere in similar JSON formats. E.g. [JSON-LD|https://json-ld.org] prefixes such kinds of JSON fields with an {{@}}. * We need some kind of mapping from the {{type}} to the fully qualified type name (the one which includes the namespace / Java package). That is also necessary to handle cases where a type with the same base name exists in multiple namespaces / Java packages. * For lat/long, we know that it is a floating point number, but we don't know which one. It would be good if the data format (somewhere outside the FS representation) had information on if this is a float, double or something else. I think there exist some conventions for encoding field type information in JSON (maybe like {{\{ "lat:f": 49.123, 'lon:d': -84,234 \}}}, but I cannot find anything right now. In any case, having this information in some "schema" part of the file may be preferable to avoid unnecessary redundancy. There is an implicit question here how much redundancy is acceptable. E.g. for a proper "wire" (network) format, having the feature names repeated in every FS might not be desirable - if the data could be represented as a JSON array, it could be much more compact (assuming that most fields have non-null/default values). But it makes the format more difficult to processes. Where is the sweet spot between wire size and ease of access? * The FSID here seems to be an actual feature of the feature structure (i.e. not "just" metadata used for references between FSes). We'd need an {{@id}} for FSes as well to allow making cross-references between them. > Clean JSON Wire Format for CAS > ------------------------------ > > Key: UIMA-6266 > URL: https://issues.apache.org/jira/browse/UIMA-6266 > Project: UIMA > Issue Type: New Feature > Components: Core Java Framework > Reporter: Daniel Gruhl > Priority: Major > > A clean format for sending CAS over the wire in JSON would make > interoperation with other text analytics systems much easier. Impact on UIMAj > would be a need for the serializer and deserializer for these formats. > > The hope would be this is NOT just a cut and past of the XMI, but rather a > clean rethink of what would represent the best wire format going forward. -- This message was sent by Atlassian Jira (v8.3.4#803005)