[
https://issues.apache.org/jira/browse/PDFBOX-4337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16646033#comment-16646033
]
Tilman Hausherr commented on PDFBOX-4337:
-----------------------------------------
There is no such thing in PDFBox and we're not thinking about doing this. PDF
is very complex (it's not like HTML) and there are many ways to do the same
thing, for example what you think is an "image" might also be a vector graphic,
or a puzzle of 1000 tiny images; a "table" is usually a vector graphic, i.e.
there is no table concept in PDF; an 99% identical text can be made of 100%
different character codes due to subsetting. To understand what I mean, open a
PDF file in PDFDebugger and look around. Also read the [PDF
specification|https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PDF32000_2008.pdf].
There are a few "PDF to XML" projects on GitHub but I doubt this will really
help you, it moves the problem elsewhere. The only thing that may help is the
TestPDFToImage.java test in the source code, this shows how to compare two PDF
renderings.
> Could extract all elements(Text, Image, Table, etc) dynamically in sequence
> from pdf file
> ------------------------------------------------------------------------------------------
>
> Key: PDFBOX-4337
> URL: https://issues.apache.org/jira/browse/PDFBOX-4337
> Project: PDFBox
> Issue Type: Wish
> Reporter: RuhongCai
> Priority: Major
> Attachments: sample_pdf.pdf
>
>
> We are trying to compare two pdf files in run time and detect the "insertion"
> , "deletion", "modification" between two files.
> PDFBOx works well for "extract Text for two files", but it is not enough for
> us,
> Does any api in pdfbox or any workaround way to "read/extract" all
> component(Table, image,Text, etc) from pdf files in sequence and return some
> related useful information.
> The attached is sample file which contains Text, Table, image, not-well
> format. Read element/component in sequence
> could do further comparison work.
> [^sample_pdf.pdf]
>
> Many thanks!
>
>
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]