Unification of HTML output from Office, OOXML and Open Document parsers
-----------------------------------------------------------------------

                 Key: TIKA-524
                 URL: https://issues.apache.org/jira/browse/TIKA-524
             Project: Tika
          Issue Type: Improvement
          Components: parser
    Affects Versions: 0.7
            Reporter: Geoff Jarrad
            Priority: Minor
             Fix For: 0.8


Word documents can easily be transformed, apparently without loss of 
information, between common storage formats, such as .doc, .docx and .odt.
However, when the above variants of a single document are analysed with the 
respective Tika parsers, OfficeParser, OOXMLParser and OpenDocumentParser, the 
resulting HTML output varies considerably between parsers.
Given the latest advances in these parsers, it should now be feasible to: (i) 
establish a common HTML representation that can adequately describe word 
document content, and  (ii) modify the aforementioned parsers to conform to 
this new standard.

Points of interest include: headings, pre-formatted text and other styles, 
headers and footers, tables, hyperlinks, etc.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to