[jira] [Commented] (TIKA-1857) Enhance PDFParser to extract text from XFA forms

Tim Allison (JIRA) Fri, 26 Feb 2016 09:41:56 -0800

    [ 
https://issues.apache.org/jira/browse/TIKA-1857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15169395#comment-15169395
 ]


Tim Allison commented on TIKA-1857:
-----------------------------------

I implemented a first attempt XFA scraper with StAX; this pulls the content 
from the fields that Pascal identified into the ContentHhandler, and it merges 
the "values" from the data section with the fields section.

Currently, if XFA exists, I process that and skip the AcroForm data.  

I'm not certain what the best path is for ignoring/processing content extracted 
from the "regular" PDF if there is XFA data.

For now, I'm also processing the contents of the rest of the PDF. I'm more 
averse to losing data than to duplication because my main use case is 
search...but I realize this will be really frustrating to users who want "just 
one copy" of the content.

In looking at the pdfs with xfa data in govdocs1, it looks like there would be 
lost content in  _some_ files if we processed only the XFA and did not do the 
regular text extraction.  On the other hand, for most of the files I examined, 
it looked like the content is entirely duplicative -- [~pascal.essiembre]'s 
point above.

I propose adding a parameter to the PDFParserConfig along the lines of 
{{ifXFAExistsProcessItAlone}}...this would allow the behavior of Pascal's 
patch.  I propose that the default be set to "false", erring on the side of 
extracting more content at the cost of duplication.

Is this ok?  Or, is there an easy way to determine if regular content is 
entirely duplicative of XFA content?



> Enhance PDFParser to extract text from XFA forms
> ------------------------------------------------
>
>                 Key: TIKA-1857
>                 URL: https://issues.apache.org/jira/browse/TIKA-1857
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>            Reporter: Pascal Essiembre
>            Priority: Trivial
>              Labels: patch
>             Fix For: 1.13
>
>         Attachments: 041617_filled_out.pdf, govdocs1_xfas.zip, 
> xfa_in_govdocs1.txt
>
>
> Extract text from PDF Forms (XFA).  Information about XFA: 
> https://en.wikipedia.org/wiki/XFA



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1857) Enhance PDFParser to extract text from XFA forms

Reply via email to