Tim Allison created TIKA-3026:
---------------------------------
Summary: Consider extracting structure/tags where possible in PDFs
with the PDFMarkedContentExtractor
Key: TIKA-3026
URL: https://issues.apache.org/jira/browse/TIKA-3026
Project: Tika
Issue Type: Task
Reporter: Tim Allison
Some PDFs contain tags that _may_ be useful in understanding the structure of
the elements within a PDF, e.g. table markup, paragraph breaks, headers, etc.
The quality of the tags depends entirely on the software and human generating
the PDF. There are no guarantees. Nevertheless, it might be useful in some
cases for users to be able to extract content with structure tags.
Some references:
[https://acrobatusers.com/tutorials/what-are-pdf-tags-and-why-should-i-care/]
[https://www.adobe.com/accessibility/products/acrobat/pdf-repair-add-tags.html]
[https://www.pdfa.org/resource/tagged-pdf-best-practice-guide-syntax/]
--
This message was sent by Atlassian Jira
(v8.3.4#803005)