[jira] [Commented] (TIKA-1857) Enhance PDFParser to extract text from XFA forms

Tim Allison (JIRA) Tue, 16 Feb 2016 18:38:57 -0800

    [ 
https://issues.apache.org/jira/browse/TIKA-1857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15149736#comment-15149736
 ]


Tim Allison commented on TIKA-1857:
-----------------------------------

This is great.  Thank you!

So, to get the best coverage for extracted content, should we do the following:

Check for fields in the AcroForm.

a) If those exist (Static XFA), use the content extracted from the AcroForm and 
ignore the XFA
b) If they don't exist (Dynamic XFA), scrape/extract info from the XFA 

In your experience, will we miss any info if we ignore the XFA for Static XFAs 
and rely solely on the AcroForm?



> Enhance PDFParser to extract text from XFA forms
> ------------------------------------------------
>
>                 Key: TIKA-1857
>                 URL: https://issues.apache.org/jira/browse/TIKA-1857
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>            Reporter: Pascal Essiembre
>            Priority: Trivial
>              Labels: patch
>             Fix For: 1.13
>
>         Attachments: 041617_filled_out.pdf, xfa_in_govdocs1.txt
>
>
> Extract text from PDF Forms (XFA).  Information about XFA: 
> https://en.wikipedia.org/wiki/XFA



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1857) Enhance PDFParser to extract text from XFA forms

Reply via email to