[jira] [Commented] (TIKA-2362) Skipping Header and Footer data from documents

Nick Burch (JIRA) Tue, 16 May 2017 07:47:20 -0700

    [ 
https://issues.apache.org/jira/browse/TIKA-2362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16012500#comment-16012500
 ]


Nick Burch commented on TIKA-2362:
----------------------------------

On the whole, the headers and footers should be in their own div tags with 
sensible sounding names. As long as you're working at the xhtml level, you 
should be able to filter those out with an xpath content handler. (You can then 
turn that back into plain text later if you want)

> Skipping Header and Footer data from documents
> ----------------------------------------------
>
>                 Key: TIKA-2362
>                 URL: https://issues.apache.org/jira/browse/TIKA-2362
>             Project: Tika
>          Issue Type: Wish
>          Components: general, handler
>            Reporter: Mujahid Ateeb Khan
>            Assignee: Tim Allison
>            Priority: Trivial
>
> Is there any method to skip header and footer data of 
> documents(pdf,docx,doc,odt)?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (TIKA-2362) Skipping Header and Footer data from documents

Reply via email to