[
https://issues.apache.org/jira/browse/TIKA-4171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17830110#comment-17830110
]
Tilman Hausherr edited comment on TIKA-4171 at 3/23/24 5:50 PM:
----------------------------------------------------------------
We have a regression with the file [^876503.pdf] in the XFAExtractor class.
What happens is that {{displayFieldName}} is now lost if {{fieldValues}} is
empty. Because of that, the text "Enter the full name of the conveying party or
parties" is missing for the field "conname1".
I'm not saying that this is wrong, I just wonder if this is intended.
was (Author: tilman):
We have a regression with the file [^876503.pdf] in the XFAExtractor class.
What happens is that {{displayFieldName}} is now lost if {{fieldValues}} is
empty. Because of that, the text "Enter the full name of the conveying party or
parties" is missing for field the "conname1".
I'm not saying that this is wrong, I just wonder if this is intended.
> Tika server only returns last value for PDFs that have multiple of the same
> key
> -------------------------------------------------------------------------------
>
> Key: TIKA-4171
> URL: https://issues.apache.org/jira/browse/TIKA-4171
> Project: Tika
> Issue Type: Bug
> Components: tika-server
> Reporter: Cassandra Xia
> Priority: Major
> Fix For: 3.0.0-BETA, 2.9.2
>
> Attachments: 20230801-5207_QF20-270 East River Solar Form 556 recert
> FINAL.pdf, 876503.pdf, example-output.txt, screenshot.png,
> testPDF_XFA_govdocs1_258578.pdf.html
>
>
> Thanks for the great work on Tika server, it is the only OSS that can handle
> Adobe's protected form format that FERC uses.
> One problem that I'm hitting is that the FERC form that I am parsing has
> multiple values for the same key name, e.g. in the screenshot below line 1-7
> all have the same key name. When Tika Server parses this PDF, it only returns
> the value in row 7 (losing the previous 6 values).
> My hunch is that somewhere in Tika Server, the values are getting stored in
> some dictionary object, so the final value is the only survivor. Would it be
> possible to return the extra values as a list from Tika Server?
> Example PDF attached - thank you for taking a look!
> !https://mail.google.com/mail/u/0?ui=2&ik=ee87dc4bd1&attid=0.0.7&permmsgid=msg-f:1782641700487887488&th=18bd372e8760fa80&view=fimg&fur=ip&sz=s0-l75-ft&attbid=ANGjdJ9qEkw6kZ9yBDfMBOUuvFB1Tk8Pti0rRvReEq-eWUoJQxLA6rZ0TQvWCsKUySaDPjjrSi-IiyKseDYpFGzF44A3iSaFw9sOanoBdFMNEZciDnaGhsUFvLSIH_0&disp=emb&realattid=ii_lmdun7ff6!
--
This message was sent by Atlassian Jira
(v8.20.10#820010)