[ 
https://issues.apache.org/jira/browse/TIKA-973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13694782#comment-13694782
 ] 

Nick Burch commented on TIKA-973:
---------------------------------

To make reviewing easier, it might be handy if you could upload a PNG 
screenshot of one of these forms, so it's quick to view that alongside 
suggested html

I'd be minded to go for something like:
  <li title="Street Number and Name">StreetNumberName[0]: 123 Main St</li>

So we'd have the alt name, the partial name, the value, but not the full name 
(but we would have the form/subform name elsewhere)
                
> PDF form data isn't included in extracted content.
> --------------------------------------------------
>
>                 Key: TIKA-973
>                 URL: https://issues.apache.org/jira/browse/TIKA-973
>             Project: Tika
>          Issue Type: Bug
>          Components: general
>    Affects Versions: 1.2
>            Reporter: Michael Graessle
>            Priority: Minor
>         Attachments: TIKA-973-patch.tar.gz
>
>
> When extracting content from PDFs, PDF form data isn't extracted. 
> The following code extracts this data via PDF box, but it seems like 
> something Tika should be doing.
> PDDocumentCatalog docCatalog = load.getDocumentCatalog();
> if (docCatalog != null) {
>   PDAcroForm acroForm = docCatalog.getAcroForm();
>   if (acroForm != null) {
>       @SuppressWarnings("unchecked")
>       List<PDField> fields = acroForm.getFields();
>       if (fields != null && fields.size() > 0) {
>         documentContent.append(" ");
>         for (PDField field : fields) {
>               if (field.getValue()!=null) {
>                 documentContent.append(field.getValue());
>                 documentContent.append(" ");
>               }
>         }
>       }
>   }
> }

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to