[jira] [Updated] (PDFBOX-4337) Could extract all elements(Text, Image, Table, etc) dynamically in sequence from pdf file

RuhongCai (JIRA) Wed, 10 Oct 2018 16:07:42 -0700


     [ 
https://issues.apache.org/jira/browse/PDFBOX-4337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


RuhongCai updated PDFBOX-4337:
------------------------------
    Description: 
We are trying to compare two pdf files in run time and detect the "insertion" , 
"deletion", "modification" between two files.

PDFBOx works well for "extract Text for two files", but it is not enough for us,

Does any api in pdfbox or any workaround way to "read/extract" all 
component(Table, image,Text, etc) from pdf files in sequence and return some 
related useful information.

The attached is sample file which contains Text, Table, image, not-well format. 
 Read element/component in sequence

could do further comparison work. 

[^sample_pdf.pdf]

 

Many thanks!

 

 

 

  was:
We are trying to compare two pdf files in run time and detect the "insertion" , 
"deletion", "modification" between two files.

PDFBOx works well for "extract Text for two files", but it is not enough for us,

Does any api in pdfbox or any workaround way to "read/extract" all 
component(Table, image,Text, etc) from pdf files in sequence and return some 
related useful information.

The attached is sample file, you could see, there are Text, Table, image, 
not-well format.  

[^sample_pdf.pdf]

 

Many thanks!

 

 

 


> Could extract all elements(Text, Image, Table, etc) dynamically in sequence 
> from pdf file 
> ------------------------------------------------------------------------------------------
>
>                 Key: PDFBOX-4337
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-4337
>             Project: PDFBox
>          Issue Type: Wish
>            Reporter: RuhongCai
>            Priority: Major
>         Attachments: sample_pdf.pdf
>
>
> We are trying to compare two pdf files in run time and detect the "insertion" 
> , "deletion", "modification" between two files.
> PDFBOx works well for "extract Text for two files", but it is not enough for 
> us,
> Does any api in pdfbox or any workaround way to "read/extract" all 
> component(Table, image,Text, etc) from pdf files in sequence and return some 
> related useful information.
> The attached is sample file which contains Text, Table, image, not-well 
> format.  Read element/component in sequence
> could do further comparison work. 
> [^sample_pdf.pdf]
>  
> Many thanks!
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (PDFBOX-4337) Could extract all elements(Text, Image, Table, etc) dynamically in sequence from pdf file

Reply via email to