[jira] [Commented] (TIKA-2702) Different behavior between TIKA and pdfbox

Lior (JIRA) Wed, 01 Aug 2018 06:12:14 -0700


    [ 
https://issues.apache.org/jira/browse/TIKA-2702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16565288#comment-16565288
 ]


Lior commented on TIKA-2702:
----------------------------

I thought that TIKA is using PDFBox, so I expected to get the same result....

I have a process which tokenize the text from a pdf, and then using the PDFBox 
TextStripper I'm searching for each token the font size...

When using PDFBox to extract the text, it's easy to do the second part....using 
TIKA is complicated for this scenario...

> Different behavior between TIKA and pdfbox
> ------------------------------------------
>
>                 Key: TIKA-2702
>                 URL: https://issues.apache.org/jira/browse/TIKA-2702
>             Project: Tika
>          Issue Type: Bug
>          Components: app
>    Affects Versions: 1.18
>            Reporter: Lior
>            Priority: Major
>
> As far as I understand, TIKA is using pdfbox for extracting text from pdf 
> files
> During a side benchmark I'm doing, I'm seeing that the text I'm getting using 
> PDFBox 2.0.9 and the text I'm getting from TIKA is not 100% the same...in 
> most cases, when there is a hyperlink inside the pdf file, the pdfbox ignore 
> the link itself, while TIKA is extracting the text, for example:
> https://www.linkedin.com/in/jhonDo
> mailto:[jho...@yahoo.com |mailto:jho...@yahoo.com]
>  
> This is really a deal breaker for me, because I'm using pdfbox for another 
> process I'm doing and I need the text to be the same, so I can't use TIKA at 
> the moment....



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (TIKA-2702) Different behavior between TIKA and pdfbox

Reply via email to