Cassandra Xia created TIKA-4171:
-----------------------------------

             Summary: Tika server only returns last value for PDFs that have 
multiple of the same key
                 Key: TIKA-4171
                 URL: https://issues.apache.org/jira/browse/TIKA-4171
             Project: Tika
          Issue Type: Bug
          Components: tika-server
            Reporter: Cassandra Xia
         Attachments: 20230801-5207_QF20-270 East River Solar Form 556 recert 
FINAL.pdf

Thanks for the great work on Tika server, it is the only OSS that can handle 
Adobe's protected form format that FERC uses. 

One problem that I'm hitting is that the FERC form that I am parsing has 
multiple values for the same key name, e.g. in the screenshot below line 1-7 
all have the same key name. When Tika Server parses this PDF, it only returns 
the value in row 7 (losing the previous 6 values).

My hunch is that somewhere in Tika Server, the values are getting stored in 
some dictionary object, so the final value is the only survivor. Would it be 
possible to return the extra values as a list from Tika Server? 

Example PDF attached - thank you for taking a look!

!https://mail.google.com/mail/u/0?ui=2&ik=ee87dc4bd1&attid=0.0.7&permmsgid=msg-f:1782641700487887488&th=18bd372e8760fa80&view=fimg&fur=ip&sz=s0-l75-ft&attbid=ANGjdJ9qEkw6kZ9yBDfMBOUuvFB1Tk8Pti0rRvReEq-eWUoJQxLA6rZ0TQvWCsKUySaDPjjrSi-IiyKseDYpFGzF44A3iSaFw9sOanoBdFMNEZciDnaGhsUFvLSIH_0&disp=emb&realattid=ii_lmdun7ff6!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to