On Thu, 22 Sep 2011, Pablo Queixalos wrote:
Based on the PowerPointExtractor implementation, I rewrote the HSLFExtractor parser. This new impl produces a better XHTML but uses the org.apache.poi.hslf POI model.
If you wouldn't mind, please create a new JIRA entry for this, and upload your patch.
- What is the philosophy of Tika parsers implementations against their dependencies ? I mean, must the HSLFExtractor implement the strict minimal code to integrate the top POI API, or it is ok to do it the way I did ?
It's fine to use other parts of the API as needed. If you look at some of the other office parsers you'll see that they all do that too
- Is there conventions for the XHTML produced by the parsers : global formatting (ie, a <div> per page, <h1> for headers) and related CSS classes ?
We try to keep the xhtml simple and clean, with sensible tags, and we try to keep it similar between different formats of the same type. Ideally if you take the same presentation and save it as .ppt, .pptx and .odp, then Tika will give you quite similar XHTML back again.
(We don't try to re-create the exact layout and formatting of the original document however)
Nick
