Unification of HTML output from Office, OOXML and Open Document parsers
-----------------------------------------------------------------------
Key: TIKA-524
URL: https://issues.apache.org/jira/browse/TIKA-524
Project: Tika
Issue Type: Improvement
Components: parser
Affects Versions: 0.7
Reporter: Geoff Jarrad
Priority: Minor
Fix For: 0.8
Word documents can easily be transformed, apparently without loss of
information, between common storage formats, such as .doc, .docx and .odt.
However, when the above variants of a single document are analysed with the
respective Tika parsers, OfficeParser, OOXMLParser and OpenDocumentParser, the
resulting HTML output varies considerably between parsers.
Given the latest advances in these parsers, it should now be feasible to: (i)
establish a common HTML representation that can adequately describe word
document content, and (ii) modify the aforementioned parsers to conform to
this new standard.
Points of interest include: headings, pre-formatted text and other styles,
headers and footers, tables, hyperlinks, etc.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.