[ 
https://issues.apache.org/jira/browse/PDFBOX-3405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15355950#comment-15355950
 ] 

Christopher Clark commented on PDFBOX-3405:
-------------------------------------------

Thats a very hacky solution, it looks like I would have to copy paste a lot of 
code from both PDFTextStripper and PDFTextStreamEngine, but I guess I could 
settle for it so this issue could be closed as far as my use case goes.

Having said that, I would certainly make the case that font size is such an an 
elementary/important part of text processing one should not need to copy paste 
a bunch code and modify the internals PDFTextStreamEngine just to get it. If 
you search for "getFontSize" and "org.apache.pdfbox" on 
github[https://github.com/search?l=java&q=getFontSize+org.apache.pdfbox+-language%3AJava&ref=searchresults&type=Code&utf8=%E2%9C%93]
 you can find many cases where users are using this method. This shows that 1) 
There is demand for font size information and 2) Since there is not easy way to 
get it, many users are currently using getFontSize(), which means there code 
will break dramatically on PDFs that encode text using transformation matrices. 

In my personal experience, I have also seen two independent researchers 
confused and frustrated by the fact that the most obvious way to get font size 
information, TextPosition.getFontSize(), seems to randomly return extremely 
incorrect results. 

So I would strongly advocate making getting actually accurate font size 
information a feature of PDFTextStripper, but if I have to I can settle for the 
approach you suggest.

> Display font size
> -----------------
>
>                 Key: PDFBOX-3405
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-3405
>             Project: PDFBox
>          Issue Type: Improvement
>            Reporter: Christopher Clark
>         Attachments: bad-font-p1.pdf
>
>
> I (along with others) have found using the font size of text to be very 
> useful when doing things like trying to recover the structure of PDFs. For 
> example, in heuristics like 'text with large font sizes are probably titles'. 
> However, I noticed a few cases where getFontSizePt or getFontSize return 
> seemingly very inaccurate results. For example, in the attached pdf the 
> getFontSizePt for the title text is over 500.
> After digging into this a little, as I understand it neither of these methods 
> return the a font size scaled to the display space. getFontSize returns the 
> "raw" encoded font size and getFontSizePt returns the font size scaled by the 
> text matrix, but not by the current transformation matrix. 
> Basically, in order to get reliable font information, it would be helpful if 
> either
> 1) getFontSizePt includes the affect of using current transformation matrix
> 2) A new method like "getDisplayFontSize" is added that returns the font 
> sizes scaled to the display space
> As a side note, I have seen several users (including myself), assume that 
> "getFontSize" returns the font size as would be observed when one opens the 
> PDF, and the been confused when these method occasionally do not return the 
> results expected. I think "getFontSize" would benefit from a clear note that 
> the results might not include scaling factors that were used when the text 
> was rendered.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to