[
https://issues.apache.org/jira/browse/TIKA-1768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Aeham Abushwashi updated TIKA-1768:
-----------------------------------
Affects Version/s: 1.13
> Document headers and footers in metadata
> ----------------------------------------
>
> Key: TIKA-1768
> URL: https://issues.apache.org/jira/browse/TIKA-1768
> Project: Tika
> Issue Type: Improvement
> Affects Versions: 1.13
> Reporter: Aeham Abushwashi
> Priority: Critical
> Attachments: HeaderAndFooterTestFiles.zip, headers_footers.patch
>
>
> I have a use case where I need document headers and footers to be explicitly
> marked as such in Tika's output metadata fields. As far as I can see, there's
> no easy built-in way for doing this.
> The attached patch adds a HeaderFooterContentHandler which enables addition
> of headers and footers into their own metadata fields. This works out of the
> box with Word file formats.
> Also included in the patch are some tweaks to enable Excel and Powerpoint
> parsers/extractors to explicitly mark headers and footers as such in the
> output XHTML and
> enable the aforementioned content handler to spot them. Unit tests have been
> added, and existing ones modified, to verify that the parsers and the content
> handler work together correctly.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)