Re: HSLFExtractor & POI : Looking for better XHTML

Nick Burch Thu, 22 Sep 2011 02:55:09 -0700

On Thu, 22 Sep 2011, Pablo Queixalos wrote:

Based on the PowerPointExtractor implementation, I rewrote theHSLFExtractor parser. This new impl produces a better XHTML but uses theorg.apache.poi.hslf POI model.

If you wouldn't mind, please create a new JIRA entry for this, and uploadyour patch.

- What is the philosophy of Tika parsers implementations against theirdependencies ? I mean, must the HSLFExtractor implement the strictminimal code to integrate the top POI API, or it is ok to do it the wayI did ?

It's fine to use other parts of the API as needed. If you look at some ofthe other office parsers you'll see that they all do that too

- Is there conventions for the XHTML produced by the parsers : globalformatting (ie, a <div> per page, <h1> for headers) and related CSSclasses ?

We try to keep the xhtml simple and clean, with sensible tags, and we tryto keep it similar between different formats of the same type. Ideally ifyou take the same presentation and save it as .ppt, .pptx and .odp, thenTika will give you quite similar XHTML back again.

(We don't try to re-create the exact layout and formatting of the originaldocument however)


Nick

Re: HSLFExtractor & POI : Looking for better XHTML

Reply via email to