[
https://issues.apache.org/jira/browse/TIKA-1857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15169395#comment-15169395
]
Tim Allison commented on TIKA-1857:
-----------------------------------
I implemented a first attempt XFA scraper with StAX; this pulls the content
from the fields that Pascal identified into the ContentHhandler, and it merges
the "values" from the data section with the fields section.
Currently, if XFA exists, I process that and skip the AcroForm data.
I'm not certain what the best path is for ignoring/processing content extracted
from the "regular" PDF if there is XFA data.
For now, I'm also processing the contents of the rest of the PDF. I'm more
averse to losing data than to duplication because my main use case is
search...but I realize this will be really frustrating to users who want "just
one copy" of the content.
In looking at the pdfs with xfa data in govdocs1, it looks like there would be
lost content in _some_ files if we processed only the XFA and did not do the
regular text extraction. On the other hand, for most of the files I examined,
it looked like the content is entirely duplicative -- [~pascal.essiembre]'s
point above.
I propose adding a parameter to the PDFParserConfig along the lines of
{{ifXFAExistsProcessItAlone}}...this would allow the behavior of Pascal's
patch. I propose that the default be set to "false", erring on the side of
extracting more content at the cost of duplication.
Is this ok? Or, is there an easy way to determine if regular content is
entirely duplicative of XFA content?
> Enhance PDFParser to extract text from XFA forms
> ------------------------------------------------
>
> Key: TIKA-1857
> URL: https://issues.apache.org/jira/browse/TIKA-1857
> Project: Tika
> Issue Type: Improvement
> Components: parser
> Reporter: Pascal Essiembre
> Priority: Trivial
> Labels: patch
> Fix For: 1.13
>
> Attachments: 041617_filled_out.pdf, govdocs1_xfas.zip,
> xfa_in_govdocs1.txt
>
>
> Extract text from PDF Forms (XFA). Information about XFA:
> https://en.wikipedia.org/wiki/XFA
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)