[jira] [Updated] (ODFTOOLKIT-400) Unable to obtain the charset encoding of an odt document

Nimarukan (JIRA) Sun, 09 Aug 2015 06:54:05 -0700

     [ 
https://issues.apache.org/jira/browse/ODFTOOLKIT-400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Nimarukan updated ODFTOOLKIT-400:
---------------------------------
    Attachment: 400-part3-main-OdfFileDom_initXmlDecl.patch
                400-part2-test-OdfFileDom_xmlDeclTest.patch
                400-part1-pom_xml-FromJava1_5To1_6ForStAX.patch

Diagnosis: No XML declaration fields of the DOM document are currently set 
because the file is parsed with a SAX parser, and SAX does not reveal the XML 
declaration to SAX handlers (org.xml.sax).

Approach: Parse the beginning bytes with a StAX parser (javax.xml.stream), 
which is included in Java 6 and later.

Attached are odfdom patches for
  - the pom.xml java version change,
  - the test case, and
  - the fix.

- POM: The source and target JDK versions are increased from JDK 1.5 to JDK 1.6 
so that StAX (javax.xml.stream) will be available.  The test case also uses 
java.nio.Charset from Java 6.

- Test: The test case includes tests for the xml declaration fields:
xmlVersion, xmlEncoding, and xmlstandalone.

- Fix: Change OdfFileDom.initialize() to use a StAX parser to read the XML 
declaration, and initialize the XML declaration fields.

(The XML declaration is parsed during initialization and not later because 
after the DOM is created, bytes are generated from the DOM, not the original 
file.  For low overhead, the same internal-document byte array is used for both 
the StAX parser and SAX parser input streams.  The StAX parser is closed 
immediately after the XML declaration fields are extracted and it does not read 
the rest of the stream.)

patch -p 1 -i 400-partN-xxx.patch

(note: OdfFileDom.java currently has a mix of '\n' and '\r\n' line terminators.)


> Unable to obtain the charset encoding of an odt document
> --------------------------------------------------------
>
>                 Key: ODFTOOLKIT-400
>                 URL: https://issues.apache.org/jira/browse/ODFTOOLKIT-400
>             Project: ODF Toolkit
>          Issue Type: Bug
>          Components: odfdom
>         Environment: linux - ubuntu 14.04
>            Reporter: Joshua
>         Attachments: 400-part1-pom_xml-FromJava1_5To1_6ForStAX.patch, 
> 400-part2-test-OdfFileDom_xmlDeclTest.patch, 
> 400-part3-main-OdfFileDom_initXmlDecl.patch, testOdt.odt
>
>
> Im trying to convert odt to html. In doing the conversion Im trying to obtain 
> the charset encoding of the odt document so that I can set the appropriate 
> value on the html end. However I always get a 'null' value when trying to 
> read the charset.
> {code}
>         OdfTextDocument odfDoc = OdfTextDocument.loadDocument(is)
>         System.out.println(odfDoc.getContentDom.getXmlEncoding)
> {code}
> For the test document attached I am expecting to get UTF-8 but always see 
> 'null'. Happens on other docs as well,
> Is there a better way to obtain the charset encoding of an odt document?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (ODFTOOLKIT-400) Unable to obtain the charset encoding of an odt document

Reply via email to