[
https://issues.apache.org/jira/browse/TIKA-4171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tilman Hausherr reopened TIKA-4171:
-----------------------------------
> Tika server only returns last value for PDFs that have multiple of the same
> key
> -------------------------------------------------------------------------------
>
> Key: TIKA-4171
> URL: https://issues.apache.org/jira/browse/TIKA-4171
> Project: Tika
> Issue Type: Bug
> Components: tika-server
> Reporter: Cassandra Xia
> Priority: Major
> Fix For: 3.0.0-BETA, 2.9.2
>
> Attachments: 20230801-5207_QF20-270 East River Solar Form 556 recert
> FINAL.pdf, example-output.txt, screenshot.png
>
>
> Thanks for the great work on Tika server, it is the only OSS that can handle
> Adobe's protected form format that FERC uses.
> One problem that I'm hitting is that the FERC form that I am parsing has
> multiple values for the same key name, e.g. in the screenshot below line 1-7
> all have the same key name. When Tika Server parses this PDF, it only returns
> the value in row 7 (losing the previous 6 values).
> My hunch is that somewhere in Tika Server, the values are getting stored in
> some dictionary object, so the final value is the only survivor. Would it be
> possible to return the extra values as a list from Tika Server?
> Example PDF attached - thank you for taking a look!
> !https://mail.google.com/mail/u/0?ui=2&ik=ee87dc4bd1&attid=0.0.7&permmsgid=msg-f:1782641700487887488&th=18bd372e8760fa80&view=fimg&fur=ip&sz=s0-l75-ft&attbid=ANGjdJ9qEkw6kZ9yBDfMBOUuvFB1Tk8Pti0rRvReEq-eWUoJQxLA6rZ0TQvWCsKUySaDPjjrSi-IiyKseDYpFGzF44A3iSaFw9sOanoBdFMNEZciDnaGhsUFvLSIH_0&disp=emb&realattid=ii_lmdun7ff6!
--
This message was sent by Atlassian Jira
(v8.20.10#820010)