Just to be sure it's well known:

The Javadoc for this class indicates that this code only does an "approximate"
representation of things.

In particular, it says:

 * Generates an *approximate* inline XML representation of a CAS.
 * Annotation types are represented as XML tags, features are represented as
attributes.
 * 
 * Features whose values are FeatureStructures are not represented.
 * Feature values which are strings longer than 64 characters are truncated.
 * Feature values which are arrays of primitives are represented by
 * strings that look like [ xxx, xxx ]
 *
 * The Subject of analysis is presumed to be a text string.
 *
 * Some characters in the document's Subject-of-analysis
 * are replaced by blanks, because the characters aren't valid in xml documents.
 *
 * It doesn't work for annotations which are overlapping, because these cannot
 * be properly represented as properly - nested XML.

Because of these "inaccuracies" are you sure you want to be using this class for
your projects?

-Marshall

On 3/28/2011 8:34 PM, Richard Eckart de Castilho (JIRA) wrote:
>      [ 
> https://issues.apache.org/jira/browse/UIMA-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
>  ]
>
> Richard Eckart de Castilho updated UIMA-2101:
> ---------------------------------------------
>
>     Attachment: UIMA-2101-eckart-20110329.patch
>
> In addition to being able to disable formatting - as motivated by Steven - I 
> would like to be able to access the SAX events generated from the CAS, so I 
> can use a custom transformer in the DKPro Core component XmlWriterInline.
>
> Added a patch to address the issue. Patch is against SVN trunk rev 1085925 of 
> the uimaj-core module.
>
> - Added new method CasToInlineXml.generateXML(CAS, FSMatchConstraint, 
> ContentHandler) which allows the user to use a custom transformer or other 
> SAX event handler.
> - Added new property outputFormatted controlling whether generated XML 
> strings are formatted or not. This property does not affect the new 
> generateXML(...) method (see above). Per default the property is set to true, 
> resembling the state without the patch.
> - Added rudimentary test case to check if (not) formatting works. Code 
> borrows from XmiCasDeserializerTest.
> - Auto-formatted using UIMA Eclipse Code profile added a few braces.
>
>
>> CasToInlineXml adds whitespace
>> ------------------------------
>>
>>                 Key: UIMA-2101
>>                 URL: https://issues.apache.org/jira/browse/UIMA-2101
>>             Project: UIMA
>>          Issue Type: Bug
>>    Affects Versions: 2.3.1SDK
>>            Reporter: Steven Bethard
>>         Attachments: UIMA-2101-eckart-20110329.patch
>>
>>
>> CasToInlineXml adds indentation between adjacent XML elements. E.g. for a 
>> single character document with a single annotation covering that one 
>> character, it will write:
>> {noformat}
>> <?xml version="1.0" encoding="UTF-8"?>
>> <Document>
>>     <uima.tcas.DocumentAnnotation sofa="Sofa" begin="0" end="1" 
>> language="x-unspecified">
>>         <uima.tcas.Annotation sofa="Sofa" begin="0" end="1"> 
>> </uima.tcas.Annotation>
>>     </uima.tcas.DocumentAnnotation>
>> </Document>
>> {noformat}
>> I think it should instead write everything in a single line, that is:
>> {noformat}
>> <?xml version="1.0" encoding="UTF-8"?>
>> <Document><uima.tcas.DocumentAnnotation sofa="Sofa" begin="0" end="1" 
>> language="x-unspecified"><uima.tcas.Annotation sofa="Sofa" begin="0" 
>> end="1"> </uima.tcas.Annotation></uima.tcas.DocumentAnnotation></Document>
>> {noformat}
>> I believe this could be fixed by replacing the line:
>> {noformat}
>> XMLSerializer sax2xml = new XMLSerializer(byteArrayOutputStream);
>> {noformat}
>> with the line:
>> {noformat}
>> XMLSerializer sax2xml = new XMLSerializer(byteArrayOutputStream, false);
>> {noformat}
>> I think it's a bug that CasToInlineXml is changing the character offsets, 
>> but I would also be happy if there was an alternate constructor or a method 
>> on CasToInlineXml that allowed disabling the formatting.
> --
> This message is automatically generated by JIRA.
> For more information on JIRA, see: http://www.atlassian.com/software/jira
>
>

Reply via email to