[ 
https://issues.apache.org/jira/browse/PDFBOX-3056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Hewson closed PDFBOX-3056.
-------------------------------
    Resolution: Won't Fix

I for one welcome such experimentation. PDFTextStripper is something of a 
legacy maze which no current members of the PDFBox project really understand. I 
certainly don't.

However, you should know that PDFTextStreamEngine exists only to provide 
PDFTextStripper with the incorrect calculations on which it depends. When I 
refactored our text handling in PDFStreamEngine for 2.0 I had no interest in 
also rewriting PDFTextStripper to take into account the fact that it depended 
on bogus upstream text calculations from PDFStreamEngine. So 
PDFTextStreamEngine was created, to supply those incorrect values until 
somebody else decides they want to fix PDFTextStripper.

So if you want to experiment with new approaches to text extraction (which 
could be _much_ simpler) then all you need is to subclass PDFStreamEngine. Its 
onGlyph method provides all you need, including a pre-calculated text rendering 
matrix (TRM). Look at PageDrawer for more information. We know that these 
values are correct, as we use them for rendering.

Take a look at 
[CustomGraphicsStreamEngine|https://github.com/apache/pdfbox/blob/trunk/examples/src/main/java/org/apache/pdfbox/examples/rendering/CustomGraphicsStreamEngine.java]
 and 
[CustomPageDrawer|https://github.com/apache/pdfbox/blob/trunk/examples/src/main/java/org/apache/pdfbox/examples/rendering/CustomPageDrawer.java]
 for inspiration. These are my new 2.0 APIs for power users and provide highly 
accurate glyph and text information. (I don't personally use PDFTextStripper).

> Make PDFTextStreamEngine public
> -------------------------------
>
>                 Key: PDFBOX-3056
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-3056
>             Project: PDFBox
>          Issue Type: Bug
>    Affects Versions: 2.0.0
>            Reporter: Ben McCann
>             Fix For: 2.0.0
>
>
> I'd like to experiment with writing my own text extractor that works better 
> for my use case than PDFTextStripper does. Hopefully I can port some of the 
> improvements to PDFTextStripper if I get something working well. I'd really 
> like PDFTextStreamEngine and its constructor to have increased visibility as 
> I believe that it's pretty reasonable for people to be able write their own 
> PDFTextStrippers without having to reimplement showGlyph.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to