[jira] [Created] (TIKA-3026) Consider extracting structure/tags where possible in PDFs with the PDFMarkedContentExtractor

Tim Allison (Jira) Thu, 16 Jan 2020 11:43:28 -0800

Tim Allison created TIKA-3026:
---------------------------------

             Summary: Consider extracting structure/tags where possible in PDFs 
with the PDFMarkedContentExtractor
                 Key: TIKA-3026
                 URL: https://issues.apache.org/jira/browse/TIKA-3026
             Project: Tika
          Issue Type: Task
            Reporter: Tim Allison



Some PDFs contain tags that _may_ be useful in understanding the structure of 
the elements within a PDF, e.g. table markup, paragraph breaks, headers, etc.  

 

 

The quality of the tags depends entirely on the software and human generating 
the PDF.  There are no guarantees.  Nevertheless, it might be useful in some 
cases for users to be able to extract content with structure tags.

 

Some references:

[https://acrobatusers.com/tutorials/what-are-pdf-tags-and-why-should-i-care/]

[https://www.adobe.com/accessibility/products/acrobat/pdf-repair-add-tags.html]

[https://www.pdfa.org/resource/tagged-pdf-best-practice-guide-syntax/]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (TIKA-3026) Consider extracting structure/tags where possible in PDFs with the PDFMarkedContentExtractor

Reply via email to